Deep Learning Notes from Zero to Hero

Comprehensive notes from the Machine Learning Research Lab course @ Universitas Indonesia [In Progress]

Introduction to Deep Learning

1. Fundamental Definition of Machine Learning

Tom Mitchell (1997) provided a concise definition of machine learning:

A computer program is said to learn from experience E with respect to some task T and performance measure P, if its performance at task T, as measured by P, improves with experience E.

Illustrative Examples:

Handwriting Recognition:
- Task (T): Classify handwritten characters
- Performance (P): Accuracy of classification
- Experience (E): Database of labeled handwritten words
Autonomous Driving:
- Task (T): Navigate and control a vehicle
- Performance (P): Distance traveled before human intervention
- Experience (E): Recorded videos and steering commands
Chess Playing:
- Task (T): Play chess strategically
- Performance (P): Win rate against opponents
- Experience (E): Database of self-play games

Major Categories of Machine Learning Tasks:

Supervised Learning:
- Classification
- Regression
Unsupervised Learning:
- Clustering
- Dimensionality Reduction
- Density Estimation
Advanced Paradigms:
- Self-supervised Learning
- Reinforcement Learning

Deep Learning is a specialized subfield of machine learning that employs neural networks with multiple layers to solve complex tasks, particularly excelling in supervised learning scenarios.

2. Supervised Learning: Core Concepts

Supervised learning involves learning a function f: X → Y from labeled pairs (x, y).

Input Space: X, typically represented as ℝ^D (D-dimensional real space)
Output Space: Y (discrete for classification, continuous for regression)
Goal: Develop a hypothesis h(x) that closely approximates f(x)

Primary Types:

Classification:

Binary Classification: y ∈ {0,1} or {−1, +1}
Multiclass Classification: y ∈ {1,2,…,C}
Multilabel Classification: Categories not mutually exclusive

Regression:

Output y ∈ ℝ (continuous values)

Real-World Applications:

Iris Dataset Classification: Distinguishing between Setosa, Versicolor, and Virginica species
Image Classification: Mapping RGB pixel arrays to categorical labels
Rubik’s Cube Orientation: Regression problem for spatial positioning
Medical Diagnostics: Melanoma detection using transfer learning techniques
Protein Structure Prediction: AlphaFold’s approach combining deep CNNs with transformers

3. Unsupervised Learning: Discovering Hidden Patterns

The primary goal is to learn the underlying distribution of input data without explicit labels.

Dataset Structure: D = {xₙ}, with no target values
Evaluation: Performance measures guided by inherent biases (e.g., clustering structure)

Self-Supervised Learning:

A powerful approach that constructs pseudo-labels from the data itself, such as predicting the next word in a sentence.

Notable Examples:

Generative Image Synthesis: Using Generative Adversarial Networks (GANs)
Large Language Models: Systems like ChatGPT trained on massive text corpora using (context, next word) prediction pairs

4. Curve Fitting: A Fundamental Regression Example

Task Definition:

Predict target value t from scalar input x (e.g., sin(2πx) + noise)

Model Formulation:

Polynomial representation: y(x, w) = ∑ wⱼ x^ⱼ

Linear with respect to parameters w
Non-linear with respect to input x

Error Measurement:

Squared Loss Function: E(w) = ½ ∑ (y(xₙ, w) − tₙ)²

Minimization yields a closed-form solution

Model Complexity Considerations:

Underfitting: Models that are too simple (M = 0 or 1) fail to capture underlying patterns
Overfitting: Excessively complex models (M = 9) memorize noise rather than learning generalizable patterns

Detecting Overfitting:

Utilize a separate test set
Calculate Root Mean Square Error (ERMS): ERMS = sqrt(1/N ∑ (y(xₙ) − tₙ)²)

Strategies for Controlling Overfitting:

Data Augmentation: More training data leads to better generalization
Regularization:
- Modified error function: Ẽ(w) = E(w) + (λ/2) w ²
- Penalizes large weight values
- Regularization strength controlled by parameter λ

Model Selection Process:

Hyperparameters: λ (regularization strength), M (model complexity)
Selection Methods: Validation set or cross-validation
Important Caution: Avoid direct optimization of λ and M on test data to prevent meta-overfitting

Probabilistic Framework

Motivation

ML models face uncertainty. Application of machine learning implies dealing with uncertainty. No machine learning model can achieve a perfect accuracy

Epistemic uncertainty or systematic uncertainty. uncertainty due to observing finite size of datasets. This uncertainty can be reduced with more data.
Aleatoric uncertainty or intrinsic uncertainty or stochastic uncertainty. uncertainty due to inherent randomness in the data. Happens because our observation can only capture partial information about the world.

Probability Theory

Probability Theory. Handling uncertainty requires the framework of probability theory

Frequentist view of probability. Probability is the proportion of the frequency of an event happening in the limit of an infinite number of trials.

Suppose an unfair coin is tossed a large number of times such that 60% of the time, the head-side faces up, and 40% of the time, the tail-side faces up. Then the probability of getting a head is 0.6.

Bayesian View of Probability. Probability is the quantification of uncertainty.

In the above example, we are not allowed to look at the coin and do not know which side is heads and which is tails. If asked to bet on whether the coin will land heads or tails, we would bet that the probability of seeing a head is 0.5.
In the absence of additional information, the above is the rational choice — whether the heads or tails occur is not itself a repeatable event, it simply unknown.
We simply assign a “score” indicating our belief about the event.
Uncertainty of our belief can be lowered once we observe more data.

Joint Probability $P(X, Y)$. Probability of $X=x$ AND $Y=y$ denoted as $P(X=x, Y=y)=p_{X,Y}(x, y)$

Conditional probability $P(X\mid Y)$. Probability of $X=x$ given $Y=y$ denoted as $P(X=x\mid Y=y)=p_{X\mid Y}(x\mid y)$

Marginal probability $P(X)$ Probability of $X=x$ obtained via sum rule, denoted as $P(X=x)=p_X(x)=\sum_{y}p_{X,Y}(x, y)$ for discrete case or $\int_{-\infty}^\infty p_{X,Y}(x, y)dy$ in continuous case.

Product rule $P(X=x, Y=y)$. $P(X=x, Y=y)=p(Y=y\mid X=x)P(X=x)$.

Probability Distribution. A probability distribution for a random variable X is “complete specification of a way to compute probabilities of events about X.” It specify probability mass function (pmf) for discrete case and probability density function (pdf) for continuous case. Typically, pmf or pdf also depends on a set of parameters that governs the shape of pmf/pdf. Keep in mind that pmf/pdf must satisfy the requirement that they are nonnegative and normalized (their sum/integrate to 1 over all their support set). If a distribution is not normalized, it is called improper.

Bayes Theorem. The core principle in bayesian statistics, how do we measure belief after seeing some data (which is learning in some sense).

\[P(Y\mid X) = \frac{P(X\mid Y)P(Y)}{P(X)}\]

Term	Name	Intuition
$P(Y\mid X)$	Posterior	What we want to know: belief about Y after seeing data X
$P(X\mid Y)$	Likelihood	How likely is data X under hypothesis Y?
$P(Y)$	Prior	Belief about Y before seeing any data
$P(X)$	Evidence (aka Marginal Likelihood)	How likely is the data X overall, across all possible values of Y?

For those who confuse likelihood and probability, here are genius explanation from my bro ChatGPT.

Likelihood and probability look similar but are conceptually different depending on the context. Whether we call $P(X\mid Y)$ the likelihood depends on what is fixed and what is variable.

Term	Variable	Fixed	Interpretation
Probability $P(X\mid Y)$	$X$	$Y$	Given model/parameter $Y$, how likely is the data $X$?
Likelihood $L(Y)=P(X\mid Y)$	$Y$	$X$	Given observed data $X$, how likely is a particular model/parameter $Y$?

Independent Event.

\[P(X,Y) = P(X)P(Y)\quad\iff\quad P(Y\mid X) = P(Y)\]

If the posterior probability equal to prior probability (Additional information does not change the prior probability), then $X$ and $Y$ are independent.

Expectation. Expectation of some function $f(x)$ is the weighted average of $f$ under a probability distribution $p(x)$.

\[\mathbb E_{p(x)}[f(x)] = \sum_x f(x)p(x) \quad\text{or}\quad \int_{-\infty}^\infty f(x)p(x)dx\]

$\mathbb E_{p(x)}[f(x, y)]$ indicates expectation of a function over several variables, and the subscript $p(x)$ denotes the distribution under which the expected value of the function is computed. In this case, $\mathbb E_{p(x)}[f(x, y)]$ becomes a function of $y$.

Conditional Expectation. Conditional expectation is defined w.r.t. a conditional distribution $p(x\mid y)$.

\[\mathbb E_{p(x|y)}[f(x)] = \sum_x p(x\mid y)f(x)\quad\text{or}\quad \int_{-\infty}^\infty p(x\mid y)f(x)dx\]

Variance. Variance of $f(x)$ measures how much the value of $f$ varies over all possible values of $x$ under a certain $p(x)$.

\[\begin{align*} Var_{p(x)}[f(x)] &= \mathbb E_{p(x)}[(f(x)-\mathbb E_{p(x)}[f(x)])^2] \\ Var_{p(x)} &= \mathbb E_{p(x)}[f(x)^2] - \mathbb E[f(x)]^2 \end{align*}\]

Covariance. measure the extent to which the two variables vary together.

\[\begin{align*} Cov_{p(x, y)}[f(x), g(y)] &= \mathbb E_{p(x, y)}[(f(x) - \mathbb E_{p(x, y)}[f(x)])(g(y) - \mathbb E_{p(x, y)}[g(y)])) \\ Cov_{p(x, y)} &= \mathbb E_{p(x, y)}[f(x)g(y)] - \mathbb E_{p(x, y)}[f(x)]\mathbb E_{p(x, y)}[g(y)] \end{align*}\]

If x and y are two vectors of random variables, their covariance is a matrix

\[Cov_{p(x, y)}[\vec x, \vec y] = \mathbb E_{p(x, y)}[xy^T] - \mathbb E_{p(x)}[x]\mathbb E_{p(y)}[y^T]\]

Note that $Cov(x, x) = Var(x)$

Example distributions

Uniform
Exponential
Laplace
Gaussian

Likelihood

Probability vs Likelihood. Probability is the chance that a particular outcome occurs based on the values of the distribution parameters. When calculating probability, we assume that the parameters are trustworthy. Likelihood is how well a sample/data provides support for particular values of the distribution parameter. When calculating likelihood, we’re trying to determine if we can trust the parameter values based on the data that we observe.

\[L(\theta\mid X=x) = P_\theta(x) = P(X=x; \theta) = f_\theta(x)\text{ for continuous pdf }f_\theta(x)\]

In continuous case, we use pdf $f$ which interpreted as density where in discrete case we use pmf $p$ which interpreted as true probability. The integral of the likelihood over all parameter values may not be equal to 1.

Computing Likelihood. Given dataset $\mathcal D={x_1,\dots, x_N}$ that is i.i.d (independent and identically distributed), the likelihood is computed as

\[L(\theta\mid \mathcal D) = \prod_{i=1}^N P(x_i;\theta) = \prod_{i=1}^N f_\theta(x_i)\]

Product/multiplication of large number of small numbers can be problematic. So we often work with log-likelihood.

\[LL(\theta) = \log L(\theta\mid \mathcal D) = \sum_{i=1}^N \log P(x_i;\theta)\]

Density estimation aka Parameter Estimation. Task of estimating the distribution, given a finite set of observations. Ill-posed because there are infinitely many distributions that could have generate the observed finite dataset.

One solution (among others) for density estimation is maximizing likelihood of the parameters with respect to the given dataset. Parameters obtained by likelihood maximization are usually called maximum likelihood estimators.

Information Theory

Shannon’s idea: relate information content to degree of surprise. It view that Surprise $\iff$ Rarity and Rarity $\iff$ Contains more information.

Measuring information. Suppose, for a discrete RV $X$, we observe that $X = x$ for a particular $x$, how much information do we receive? Let $h(x)$ be the measure of information.

h(x) should be additive. If two unrelated facts x, y occur at the same time, the surprise seeing both of them should be the sum of the individual surprises. So,

\[h(x, y) = h(x) + h(y)\]

But, x and y are unrelated, hence also independent. So, $p(x, y) = p(x)p(y)$ and $h(x)$ should be logarithm of $p(x)$.

$h(x)$ should be monotonic in the opposite direction to $p(x)$. $h(x)$ increases iff $p(x)$ decreases.

We obtain,

\[h(x) = -\log_b p(x)\]

Entropy: expected amount of information. Entropy H[x] of x: expected amount of information (i.e., amount of surprise) over all possible outcomes of x under distribution of x, which is p(x):

\[\mathbb H[x] = \mathbb E_{p(x)}[h(x)] = \sum_x p(x)h(x) = -\sum_x p(x)\log p(x)\]

Entropy can also be understood as a measure of uncertainty of a system.

Minimum and Maximum Entropy Problem. The minimum of H[p] is H[p] = 0 that happens when there exists some $x_i$ such that $p(x_i) = 1$ while $p(x_j) = 0$ for the other $x_j$’s. Its maximum however, is harder and more useful, choosing the most unbiased distribution given what you know.

If you know nothing about a discrete variable, use uniform distribution (max entropy under no constraint).
If you know mean and variance, Gaussian has max entropy among all distributions with that mean and variance.
If you only know a positive support and expected value, Exponential distribution is max entropy.

Solving Maximum Entropy Problem using Langragian Contrained Optimization.

We aim to solve for $p(x_i)$ (discrete case) that maximize following objective $\begin{align*} L(x, \lambda) &= f(x) + \lambda g(x) \\ \mathbb H[x] &= -\sum_{x_i} p(x_i)\log p(x_i) + \lambda\left(\sum_i p(x_i) - 1 \right) \\ \end{align*}$

See that contrained $g(x)$ is to kept them under probability axiom (sum to 1). In the end, we would have something like $p(x_1)=p(x_2)=\dots=p(x_n)=\exp(\lambda-1)=1/K$. In other words, H[p] is maximum when p(x) is a uniform (discrete) distribution.

Entropy for continuous variable. We can extend entropy to continous variables as differential.

\[\mathbb H[x] = \lim_{\Delta\to 0}\left(-\sum_i p(x_i)\Delta\log p(x_i) \right) = -\int_x p(x)\log p(x)dx\]

One can show using calculus of variations and Lagrange function that the distribution that maximizes the differential entropy is the Gaussian.

Cross Entropy. Cross-entropy measures the difference between two probability distributions: the true distribution $p(x)$ and the predicted distribution $q(x)$. It tells you how many “nats” or “bits” you need to encode data from $p(x)$ using the coding scheme designed for $q(x)$.

\[\begin{align*} \mathbb H_p(q) &= \mathbb H(q, p) = -\sum_x q(x)\log p(x) \\ \mathbb H_p(p) &= \mathbb H[p] \end{align*}\]

Note that we use parentheses for cross entropy instead of square brackets.

Expression above is the same as saying how many bits (or nats) you need to encode data from $q(x)$ using the coding scheme designed for $p(x)$. In machine learning, the message corresponds to the true (unknown) distribution of the prediction outcome, while the code corresponds to the approximating distribution of the prediction obtained by the model.

Kullback-Leibler (KL) Divergence. The average additional amount of information (in nats) needed to specify the value $x$ (using the approximating distribution $q(x)$), is given by:

\[\mathbb {KL}_q(p\mid\mid q) = \mathbb H_q(p) - \mathbb H_p[x] = \sum_x p(x)\log\frac{p(x)}{q(x)}\]

KL divergence can be interpreted as a measure of dissimilarity between two distributions $p(x)$ and $q(x)$.

KL Divergence and Likelihood. If training data comes from the unknown distribution $p(x)$, we approximate $p(x)$ using some parametric distribution $q(x

\theta)$. $\theta$ can be determined by minimizing $\mathbb {KL}(p

q)$ but $p(x)$ is unknown. Since we have training data ${x_i}$ drawn from $p(x)$, we can estimate as a finite sum over these points.

\[\begin{align*} \mathbb{KL}(p||q) &= -\int p(x)\log\frac{q(x; \theta)}{p(x)} dx \\ &= \frac{1}{N}\sum_{i=1}^N\left(-\log\frac{q(x_i;\theta)}{p(x_i)}\right) = \frac{1}{N}\sum_{i=1}^N(-\log q(x_n;\theta) + \log p(x_i)) \end{align*}\]

log $p(x_i)$ is independent of $\theta$ and $q(x_

\theta)$ is negative log-likelihood. Therefore minimizing $\mathbb{KL}(p

q)$ is equivalent to maximizing log-likelihood $\theta$ of $q$. In deep learning, KL-divergence is usually used in variational autoencoders, generative adversarial networks, and diffusion models.

Linear Regression

Task. Predict continuous target t from input $x\in R^D$

Model.

\[y(x, w) = \sum_{j=0}^{M-1} w_j\phi_j(x) = w^T\phi(x)\]

Basis Function. Linear in weights, but not necessarily in inputs.

Polynomial $\phi_j(x) = x^j$
Gaussian $\phi_j(x) = \exp\left(-\frac{(x-\mu_j)^2}{2s^2}\right)$
Sigmoid $\phi_j(x)=\sigma\left(\frac{x-\mu_j}{s}\right)$
Fourier and Wavelet bases

Probabilistic Modeling. Assume noise in targets $t$ such that

\[t = y(x, w) + \epsilon\quad e\sim\mathcal N(0, \sigma^2) \\\]

Where

\[y(x, w) = w^T\phi(x)\]

Probabilistically, we can write

\[p(t\mid x, w, \sigma^2) = \mathcal N(t\mid y(x, w), \sigma^2)\]

Given a dataset of inputs $X={x_1,\dots, x_N}$ with corresponding targets $t=(t_1,\dots,t_N)^T$ where data points are i.i.d. The likelihood of parameters $w, \sigma^2$ defined as

\[\begin{align*} L(w, \sigma^2\mid X, t)&=p(t\mid X,w,\sigma^2)=\prod_{i=1}^N\mathcal N(t_i\mid w^T\phi(x_i), \sigma^2) \\ LL(w, \sigma^2\mid X, t)&= \sum_{i=1}^N N(t_i\mid w^T\phi(x_i), \sigma^2) \\ LL(w, \sigma^2\mid X, t)&= N\log \frac{1}{\sigma\sqrt{2\pi}} -\frac{1}{2}\sum_{i=1}^N \left(\frac{t_i-w^T\phi(x_i)}{\sigma}\right)^2 \\ LL(w, \sigma^2\mid X, t)&= -\frac{N}{2}\log (2\pi\sigma^2) - \frac{1}{\sigma^2}\left(\underbrace{\frac{1}{2}\sum_{i=1}^N ({t_i-w^T\phi(x_i)})^2}_\text{Sum of Squares Error}\right) \\ \end{align*}\]

if $\sigma^2$ is assumed to be fixed/constant, maximizing the log-likelihood above is equivalent to minimzing the sum-of-squares error.

The maximization of the log-likelihood in previous equation is done with respect to two parameters: $w$ and $\sigma^2$, leading to solutions for both defined as:

\[w_{MLE} = \arg\max_w\log p(t\mid X, w, \sigma^2)\quad \sigma^2_{MLE}=\arg\max_{\sigma^2} \log p(t\mid X, w, \sigma^2)\]

Both are called maximum likelihood estimators for the model. For homoskedastic linear regression model, $\sigma^2$ is actually assumed to be constant, leading to $w_{MLE}$ as a least-square solution to the sum-of-squares error.

Design Matrix. Is a matrix $X\in R^{N\times D}$ for number of data $N$ and feature dimension $D$. The basis function $\phi: R^N\to R^M$ is applied to each row of $X$ to yield a (row) vector of length $M$. $\vec y=[y_i]_i$ is also commonly used together with Design Matrix.

Maximize log likelihood. We will try to solve this using gradient in a closed form style, make it equal to 0. We will solve it in design matrix form $\Phi \in $.

\[\begin{align*} &\frac{\partial}{\partial w}\left[- \frac{N}{2}\log (2\pi\sigma^2) - \frac{1}{\sigma^2}\left(\frac{1}{2}\sum_{i=1}^N ({t_i-w^T\phi(x_i)})^2\right)\right] \\ &\frac{\partial}{\partial w}\left[- \frac{N}{2}\log (2\pi\sigma^2) - \frac{1}{\sigma^2}\left(\frac{1}{2} (\vec t-\Phi w)^T(\vec t-\Phi w)\right)\right] \\ &= -\frac{1}{2\sigma^2}\frac{\partial}{\partial w}(\vec t-\Phi w)^T(\vec t-\Phi w) \\ &= -\frac{1}{2\sigma^2}[(-\Phi)^T(\vec t-\Phi w)]^T\cdot(\vec t-\Phi w)^T(-\Phi) \\ &= -\frac{1}{2\sigma^2}(\vec t-\Phi w)^T(-\Phi)\cdot(\vec t-\Phi w)^T(-\Phi) \\ &= \frac{1}{\sigma^2}(\vec t-\Phi w)^T\Phi = 0 \\ &= (\vec t^T-w^T\Phi^T)\Phi = t^T\Phi - w^t\Phi^T\Phi =0 \\ &\iff \Phi^T\Phi w = \Phi^T t \iff w = (\Phi^T\Phi)^{-1}\Phi^T t \end{align*}\]

This is normal equation: $w_{mle} = (\Phi^T\Phi)^{-1}\Phi^T t$. Can lead to numerical difficulties when $\Phi^T\Phi$ is close to singular. Can be addressed using SVD or sequential or online stochastic gradient descent algorithms.

Regularized Least Square Solving for Overfitting Problem. We change the objective to minimize for

\[\begin{align*} &E_D(w) + \lambda E_W(w) \\ &=\frac{1}{2}\sum_i (t_i-w^T\phi(x_i))^2 + \frac{\lambda}{2}w^Tw \\ \end{align*}\]

We use $\ell_2$ regularizer for $E_w$. This method also called as weight decay allowing our models to turning off some of the parameters $w$.

Multiple Regression. Conditional distribution of the $K$-dimensional target vector $\vec t$.

\[\begin{align*} p(\vec t\mid x, W, \sigma^2) &= \mathcal N(\vec t\mid W^T\phi(x), \sigma^2 I) \\ &= \frac{1}{\sqrt{(2\pi)^K\vert\sigma^2 I\vert}}\exp\left(-\frac{1}{2}(\vec t - W^T\phi(x))^T(\sigma I)^{-1}(\vec t - W^T\phi(x))\right) \end{align*}\]

The above Gaussian has:

a $K$-dimensional vector mean $\mu = W^T\phi(x)$,
a K × K covariance matrix $\Sigma = \sigma^2 I$, required to be positive-semi definite — the notation $ \Sigma = \sigma^2 I $ represents the determinant of $\Sigma$.

The Gaussian is assumed isotropic: equal variance in each dimension (thus equal density in every direction) and all dimensions are independence. So, the shape will be spherical and aligned at all axes.

Decision Theory

Approximating Distribution to Decision Probablistic model $p(t

x, w_{ML}, \sigma^2_{ML}) = N (t

y(x, w_{ML}), \sigma^2_{ML})$ is a predictive distribution expressing uncertanity over value of t for some new input x. But often we need to predict specific value for t, rather than returning the distribution, especially if we must take a specific action. Example, If our aim is to determine the optimal level of radiation to use for treating tumor and model predicts a probability distribution over radiation dose, then we need to use that distribution to determine the specific dose to be administered. We explain this formally using the notion of decision theory.

Formal Objective. Suppose we choose $f(x)$ for our prediction when the true value is $t$. Then, this choice incur a penalty/cost given by a loss function $L(t, f(x))$. We define the objective

\[\mathbb E[l] = \int\int L(t, f(x))\,dx\,dt\]

Example of $L(t, f(x))$ is squared loss.

\[L(t, f(x)) = (f(x)-t)^2\]

Choosing decision function f(x). Minimizing above expected loss using squared loss, using calculus of variations, we obtain.

\[f^*(x) = \int tp(t\mid x)\,dt=\mathbb E_t[t\mid x]\]

$f^*(x) = \mathbb E_t[t\mid x]$ is called the regression function.

Which is just conditional expectation of $t$ given $x$. For Gaussian conditional distribution, this becomes the following.

\[\mathbb E[t\mid x] = \int tp(t\mid x)\,dt = y(x, w)\]

Bias Variance Tradeoff

Derivation of Bias Variance Tradeoff.

\[\mathbb E[L] = \int (f(x)-\mathbb E[t\mid x])^2p(x) dx + \int var(t\mid x)p(x)dx\]

Note that term $\mathbb E[t\mid x]$ is just one choice of the regression function. Second term is independent of $f(x)$ arises from intrinsic noise of the data – represents minimum value achievable for expected loss.

Consider the first integrand, done over a particular dataset $D$: $(f(x; D) - h(x))2$. We do experiments over the ensemble of datasets, so we can take the average of the above over many possible datasets. Consider we model $\mathbb E[t\mid x]=h(x)$.

There will be 3 terms with the last term will vanish. The first term is the squared bias of the model. Second term is variance of the

We will ended up with expected $loss =\text{ bias }^2 +\text{ variance }+\text{ irreducible noise}$.

Classification

Task. Predicting a class label $y\in{1,\dots, K}$ from input $x\in R^D$.

Three approaches to solve classification task.

Discriminant function: directly assigns $x$ to a specific class.
Discriminative probabilistic model: train the conditional distribution $p(C_k x)$ directly using a training set, and then use the model to decide the class of $x$.

Generative probabilistic model: model the class-conditional density $p(x

C_k)$, and the class prior $p(C_k)$, then compute the posterior $p(C_k

x)$ via Bayes theorem; after that, use the posterior distribution to decide the class of $x$.

Maximum Likelihood on Classification Task. Likelihood is defined the same regardless of regression or classification. All that matters is our estimator definition. Suppose we model classification task as below.

\[p(y = c\mid x; w) = f_c(x; w)\]

where $f_c(x; w)$ is a parametric function. Likelihood is in the form of.

\[\begin{align*} L(w\mid X, y) = \prod_{i=1}^N p(y_i = c_i\mid x_i; w) \\ LL(w\mid X, y) = \sum_{i=1}^N \log p(y_i = c_i\mid x_i; w) \end{align*}\]

Linear discriminant function: Two-class case. Suppose we model

\[y(x) = w^Tx + w_0\]

The decision to assign $x$ into class $\hat y$ is

\[\hat y = \begin{cases} C_1\quad\text{if } y(x) \ge 0\\ C_2\quad\text{otherwise} \end{cases}\]

Decision boundary corresponds to the $y(x)=0$, $D-1$ dimensional hyperplane within $D$ dimensional input space.

Decision boundary is orthogonal to w. Under the same model above, we could prove that for any vector $(x_0-x_1)$ such that $y(x_0) = y(x_1) = 0$ (they are on the same side of the decision boundary), they are orthogonal with $w$.

\[\begin{align*} x_0, x_1 : y(x_0) = y(x_1) = 0 \\ w^Tx_0 = w^Tx_1 \implies w^T(x_0 - x_1) = 0 \\ \langle w, x_0-x_1\rangle = 0 \end{align*}\]

where $\langle \cdot, \cdot \rangle$ is the standard dot product. This property holds true for any two points $x_0, x_1$ on the decision boundary. This implies that the distance between any $x$ and the decision boundary is $r\frac{w}{\Vert w\Vert}, r\in R$.

Multiple classes. The trick above (Defining decision boundary $y_c(x)>0$) will draw a single line for single class $c$.

One-versus-the-rest. When there are $C$ classes, we draw $C$ decision boundaries, but this will lead into ambiguous regions where both classes are predicted (problem in mutually exclusive classes).
One-versus-one. When there are $C$ classes, we draw $C(C-1)$ decision boundaries, but this will lead into ambiguous regions where both classes are predicted (problem in mutually exclusive classes).
Solution, use $K$ linear functions $f_1(x),\dots,f_K(x)$, then $\hat y = \arg\max_{k=1,\dots, K} f_k(x)$, i.e assign $x$ to the class $i$ with highest $f_i(x)$. Decision regions will be singly connected and convex.

Using Least Square to find $W$. For one hot vector $t_i\in R^K$ with prediction $w^T\phi(x_i)\in R^K$, we stack into the matrix form $T=[t_i]_i\in R^{N\times K}, X=[\phi(x_i)]_i\in R^{N\times D}, W=[w^T_i]_i \in R^{D\times K}$ (essentially just stacking vertically).

\[\begin{align*} E_D(w) &= \frac{1}{2}\sum_i^N \Vert w^T\phi(x_i) - t_i \Vert^2\\ E_D(w) &= \frac{1}{2}\sum_i^N \sum_j^K (w^T\phi(x_i)_j - t_{ij})^2\\ E_D(w) &= \frac{1}{2}\sum_i^N \sum_j^K {\Vert XW-T \Vert^2_F}_{ij}\\ E_D(w) &= \frac{1}{2}{\Vert XW-T \Vert^2_F} \\ E_D(w) &= \frac{1}{2} \operatorname{Tr}\left((XW-T)^T(XW-T)\right) \\ \end{align*}\]

Decision Theory

Prediction and Action. Prediction in supervised learning is modeled probabilistically as a posterior distribution $p(y\mid x)$. Intuitively, our follow up action for prediction is to choose the value of $y$ that makes $p(y\mid x)$ the highest. But is this mathematically justified?

Mathematical Justification. We assume an agent / decision maker has set of possible action $a\in\mathcal A$. Each action $a$ has costs and benefits that depend on the state (hidden) of nature $y$. These dependencies can be modeled as a loss function $\ell(y, a)$ that gives the loss incurred if the agent takes action $a$ when the state of nature is $y$.

Posterior Expected Risk. We define the posterior expected risk as

\[R(a\mid x) = \mathbb E_{p(y\mid x)}[\ell(y, a)] = \sum_{y\in\mathcal Y} p(y\mid x)\ell(y, a)\]

Optimal Policy (Bayes Estimator). The optimal policy is the one that minimizes the posterior expected risk.

\[a^* = \arg\min_{a\in\mathcal A} R(a\mid x)\]

Alternative: Use utility function, Maximum Expected Utility Principle. Define $U(y, a)$ expressing desirability of each possible action and state then seek action that maximize expected utility.

\[a^* = \arg\max_{a\in\mathcal A} \mathbb E_{p(y\mid x)}[U(y, a)]\]

Decision on Classification, Zero-one loss for Binary Classification. Assign loss 1 if action is wrong, 0 otherwise.

\[\ell_{zo}(y, a) = \mathbb I\{y\ne a\}\]

$a$	$y=0$	$y=1$
$a=0$	0	1
$a=1$	1	0

Posterior expected loss for each action $a$ is

\[\begin{align*} R(a\mid x) &= \sum_{y\in\mathcal Y} p(y\mid x)\ell_{zo}(y, a) \\ &= \underbrace{p(y_0\mid x)}_{\text{When }a=1} + \underbrace{p(y_1\mid x)}_{\text{When }a=0} \\ &= p(y\neq a\mid x) = 1-p(y=a\mid x) \end{align*}\]

Optimal policy is

\[\begin{align*} a^* &= \arg\min_{a\in\{0,1\}} R(a\mid x) \\ &= \arg\min_{a\in\{0,1\}} 1-p(y=a\mid x) \\ &= \arg\max_{a\in\{0,1\}} p(y=a\mid x) \end{align*}\]

Maximum a posteriori (MAP) estimate. This maximization decision rule is corresponds to the mode of the posterior distribution. Specifically for binary case:

\[a^* = \begin{cases} 0&\text{if }p(y=0\mid x)\ge p(y=1\mid x) \\ 1&\text{otherwise} \end{cases}\]

Cost Sensitive Classification. Instead of zero-one loss, we may have the following cost matrix where $\ell_{ij}$ is the cost of predicting $j$ when true label is $i$.

Predicted label	a = 0	a = 1
True label
y = 0	$\ell_{00}$	$\ell_{01}$
y = 1	$\ell_{10}$	$\ell_{11}$

\[\begin{align*} R(a\mid x) &= \sum_{y\in\mathcal Y} p(y\mid x)\ell_{y a} \\ R(a\mid x) &= \ell_{00}p(y=0\mid x) + \ell_{01}p(y=0\mid x) + \ell_{10}p(y=1\mid x) + \ell_{11}p(y=1\mid x) \\ R(a = 0\mid x) &= \ell_{00}p(y=0\mid x) + \ell_{10}p(y=1\mid x) \\ R(a = 1\mid x) &= \ell_{01}p(y=0\mid x) + \ell_{11}p(y=1\mid x) \end{align*}\]

Decision rule is to pick $a$ that minimizes the posterior expected loss:

$\begin{align*} a^* &= \arg\min_{a\in\{0,1\}} R(a\mid x) \\ &= \min\left\{\underbrace{\ell_{00}p(y=0\mid x) + \ell_{10}p(y=1\mid x)}_{a=0},\;\underbrace{\ell_{01}p(y=0\mid x) + \ell_{11}p(y=1\mid x)}_{a=1}\right\}\\ \end{align*}$ Which turn into this decision rule

\[\begin{align*} \ell_{00}p(y=0\mid x) + \ell_{10}p(y=1\mid x) &\dots \ell_{01}p(y=0\mid x) + \ell_{11}p(y=1\mid x) \\ (\ell_{00}-\ell_{01})p(y=0\mid x) + (\ell_{10}-\ell_{11})p(y=1\mid x) &\dots 0 \\ (\ell_{00}-\ell_{01})p(y=0\mid x) &\dots (\ell_{11}-\ell_{10})p(y=1\mid x) \\ \frac{p(y=0\mid x)}{p(y=1\mid x)} &\dots \frac{\ell_{11}-\ell_{10}}{\ell_{00}-\ell_{01}} \\ a = \begin{cases} 0&\text{if }\frac{p(y=0\mid x)}{p(y=1\mid x)}\ge\frac{\ell_{11}-\ell_{10}}{\ell_{00}-\ell_{01}} \\ 1&\text{otherwise} \end{cases} \end{align*}\]

Reject Option Classification. Sometimes we want to say “I don’t know” instead of returning an answer (called the reject option). The loss will then become

\[\ell l(y, a) = \begin{cases} 0&\text{if }a=\hat y\\ \lambda_r&\text{if }a=\text{reject}\\ \lambda_e&\text{otherwise} \end{cases}\]

The optimal decision rule that minimizes posterior risk is (prove as exervise)

\[a^* = \begin{cases} \arg\max\limits_{y\in\{1,\dots,C\}} p(y|x) & \text{if } \max\limits_{y\in\{1,\dots,C\}} p(y|x) > \lambda = 1 - \frac{\lambda_r}{\lambda_e} \\ \text{reject} & \text{otherwise} \end{cases}\]

Evaluation Matrix

Confusion Matrix. We can define the number of true positives (TP), false positive (FP), true negatives (TN), and false negatives (FN) with respect to the threshold $\tau$:

\[\begin{align*} TP &= \sum_{i=1}^N \mathbb I\{y_i = 1, \hat y_i = 1\} \quad TN &= \sum_{i=1}^N \mathbb I\{y_i = 0, \hat y_i = 0\} \\ FP &= \sum_{i=1}^N \mathbb I\{y_i = 0, \hat y_i = 1\} \quad FN &= \sum_{i=1}^N \mathbb I\{y_i = 1, \hat y_i = 0\} \end{align*}\]

Thus we can define the following confusion matrix:

Prediction	0	1	Row sum
Truth
0	TN	FP	$N$
1	FN	TP	$P$
Col sum	$\hat N$	$\hat P$	$N + P$

Metric from Confusion Matrix.

\[\begin{align*} \text{Accuracy} &= \frac{TP + TN}{N} = \frac{TP + TN}{TP + TN + FP + FN} \\ \end{align*}\]

Metrics w.r.t. rowwise normalization (i.e., truths).

True negative rate (TNR) or specificity

\[\quad TNR_\tau = \frac{TN_\tau}{N_\tau} = \frac{TN_\tau}{FP_\tau + TN_\tau} = p(\hat y = 0|y = 0; \tau)\]

False positive rate, false alarm rate, type I error rate, or fallout

\[FPR_\tau = \frac{FP_\tau}{N_\tau} = \frac{FP_\tau}{FP_\tau + TN_\tau} = p(\hat y = 1|y = 0; \tau)\]

False negative rate, miss rate, or type II error rate

\[FNR_\tau = \frac{FN_\tau}{P_\tau} = \frac{FN_\tau}{TP_\tau + FN_\tau} = p(\hat y = 0|y = 1; \tau)\]

True positive rate, sensitivity, recall, or hit rate

\[TPR_\tau = \frac{TP_\tau}{P_\tau} = \frac{TP_\tau}{TP_\tau + FN_\tau} = p(\hat y = 1|y = 1; \tau)\]

Metrics w.r.t. columnwise normalization (i.e., predictions).

Negative predictive value

\[NPV_\tau = \frac{TN_\tau}{\hat N_\tau} = \frac{TN_\tau}{TN_\tau + FN_\tau} = p(y = 0|\hat y = 0; \tau)\]

False omission rate

\[FOR_\tau = \frac{FN_\tau}{\hat N_\tau} = \frac{FN_\tau}{TN_\tau + FN_\tau} = p(y = 1|\hat y = 0; \tau)\]

False discovery rate

\[FDR_\tau = \frac{FP_\tau}{\hat P_\tau} = \frac{FP_\tau}{FP_\tau + TP_\tau} = p(y = 0|\hat y = 1; \tau)\]

Precision or positive predictive value

\[Prec_\tau = PPV_\tau = \frac{TP_\tau}{\hat P_\tau} = \frac{TP_\tau}{FP_\tau + TP_\tau} = p(y = 1|\hat y = 1; \tau)\]

Choosing from a number of different thresholds. With ROC curves, we can consider a set of different thresholds and compare the resulting performance. For each threshold $\tau$, we plot $FPR_\tau$ versus $TPR_\tau$ as a point and draw a curve connecting them. This gives the receiver operating characteristic (ROC) curve.

ROC curve is insensitive to changes in class distribution. Changes in the proportion of positive to negative instances in the (test) set do not cause change to the curve.

Precision-Recall (PR) curve.

F-scores. $F_\beta$ score is a combined statistics between precision (Prec) and recall (Rec).Useful for looking at a single threshold. F-score is affected by class distribution changes.

Classification Modeling

Three approaches for training and inference.

Generative Models. Train class conditional densities $p(x\mid C_k)$ for each class $C_k$ individually. Then use Bayes theorem to compute the posterior probabilities:

\[p(C_k\mid x) = \frac{p(x\mid C_k)p(C_k)}{p(x)} = \frac{p(x\mid C_k)p(C_k)}{\sum_{i=1}^C p(x\mid C_i)p(C_i)}\]

“Generative” because by sampling from them, it is possible to generate synthetic data points in the input space.

Deterministic Models. Train posterior class probabilities $p(C_k\mid x)$. Then, use decision theory to perform inference to assign each new $x$ to one of the classes.
Discriminative Models. Find a discriminant function $f(x)$ that maps each input $x$ directly onto a class label, e.g., for binary classification $f$ might be binary valued such that $f = 0$ represents class $C_1$ and $f = 1$ represents class $C_2$.

Why compute posterior instead of discriminant function directly?.

Modular, For problems where loss matrix is revised from time to time, knowing posterior probabilities implies that we only need to modify minimum risk criterion without having to re-train the model.
Reject option. Posterior probabilities allow us to determine a rejection criterion that minimizes the expected loss for a given fraction of rejected data points
Can compensate different class priors. Model trained using one class prior can be adapted to new class prior
Combining models. Different posterior probabilities can be combined.

Discriminative Classifiers, Logistic Regression. Intuitively, we classify

\[f(x) = \begin{cases} 1&\text{if }w^Tx + b > 0 \\ 0&\text{otherwise} \end{cases} = \mathbb I\left(\log \frac{p(y=1\mid x)}{p(y=0\mid x)}>0\right)\]

We assume the data is linearly separable using above equation. In another form, we could formulate $f(x)$ as a log-odds.

\[\begin{align*} & w^Tx + b = \log\underbrace{\frac{p(y=1\mid x)}{p(y=0\mid x)}}_{\text{Log Odds}} \\ &\iff p(y=1\mid x) = \frac{1}{1+\exp(-w^Tx+b)} \end{align*}\]

The aforementioned model is called (binary) logistic regression

\[p(y=1\mid x) = \sigma(w^Tx+b)\]

MLE for Logistic Regression. $\begin{align*} NLL(w) &= -\frac{1}{N} \log p(D|w) = -\frac{1}{N} \log \prod_{n=1}^N Ber(y_n|\sigma(a_n)) \\ &= -\frac{1}{N} \sum_{n=1}^N \log[\sigma(a_n)^{y_n} \cdot (1-\sigma(a_n))^{1-y_n}] \\ &= -\frac{1}{N} \sum_{n=1}^N [y_n \log \sigma(a_n) + (1-y_n) \log(1-\sigma(a_n))] \\ &= \frac{1}{N} \sum_{n=1}^N H(y_n,\sigma(a_n)) \\ \nabla_w NLL(w) &= \frac{1}{N} \sum_{i=1}^N (\sigma(a_i)-y_i)x_i \\ \end{align*}$

Cons that occurs in linearly separable datasets:

As we increate magnitude of $w$, it will always lead to higher likelihood
Logistic sigmoid becomes infinitely steep in feature space
Each class $k$ is assigned posterior $p(C_k x) = 1$
Many hyperplanes perfectly separate the data, MLE gives no preference (inductive bias) between them.
Solution is to use regularization

Multinomial Softmax. We generalize to multiple class $C={1, 2, 3, \dots}$ based on the following idea

\[f_c(x; w_c) = \mathbb I(p(y=c\mid x)>p(y=1\mid x))\]

Long story short, this implies

\[p(y=k\mid x) = \frac{\exp(w_k^Tx)}{\sum_{i=1}^C \exp(w_i^Tx)}\]

Note that for every class $k$, it has its own weight $w_k$. The value of $p(y = c\mid x)$ is invariant under translation in by the same value in each dimension.

\[\mathcal S(Wx) = \left[\frac{e^{w_1^Tx}}{\sum_{i=1}^C e^{w_i^Tx}}, \dots, \frac{e^{w_C^Tx}}{\sum_{i=1}^C e^{w_i^Tx}}\right]^T\]

Since $\mathcal S(Wx)$ is a vector of C probability values, softmax regression is based on a categorical distribution M whose parameters are precisely those C probability values:

\[p(y\mid x; \theta) = \mathcal M(y\mid \theta), \theta=\mathcal S(Wx)\]

Log Sum Trick. When working with softmax distribution, we often want to compute the normalized probability. Define $a_i=w_i^Tx$

$p(y=c\mid x) = \frac{\exp(a_c)}{\sum_{i=1}^C \exp(a_i)} = \frac{\exp(a_c)}{Z(a)}$ where $Z(a)$ is the sum defined the same across all $c$.

Its computation may not be numerically stable.

Explosive $\exp a_i$, since $a_i$ could take a number 1000 for example
Division by $Z(a)=0$, since $a_i<-1000$ there is possibility $\sum_i \exp a_i = 0$

Lets derive a better numerical version.

\[\begin{align*} p(y=c\mid x) &= \frac{\exp(a_c)}{\sum_{i=1}^C \exp(a_i)} = \frac{\exp(a_c)}{ \exp \log \sum_{i=1}^C \exp(a_i)} \\ &= \exp\left(a_c - \log \sum_{i=1}^C \exp(a_i)\right) \\ &= \exp\left(a_c - \log (\exp(m)\sum_{i=1}^C \exp(a_i-m))\right) \\ &= \exp\left(a_c - m + \log \sum_{i=1}^C \exp(a_i-m)\right) \end{align*}\]

We choose $m = \max_i(a_i)$ so that the largest exponential will be zero. Even if there’s underflow, the answer is still sensible.

Deep Neural Networks

Motivation

Motivation. When one uses linear models, one strongly assumes that the input-output mapping is linear. Model can be made more flexible by performing non linear feature transformation. But hand-crafting such transformation is very restrictive: curse of dimensionality. Real data does not live in all regions of the input space.

Data-dependent basis functions.

Hierarchical Non Linearity.

Multilayer Perceptions Formulation.

\[z_j^{(\ell)}= h(a_j^{(\ell)}) = h\left(\sum_{i=1}^{n_{\ell-1}} w_{ij}^{(\ell)} z_i^{(\ell-1)} + b_j^{(\ell)}\right) = h^\ell(W^{(\ell)}z^{(\ell-1)})\]

DNN as universal approximator, no free lunch theorem, and inductive bias

Heteroskedastic and Homoskedastic non linear regression.

Representation, Transfer, Multitask, and Meta Learning.

Contrastive Learning. Learn representation such that positive input pairs are close and negative input pairs are far part in embedding space. So error function is defined wrt inputs. Example is infoNCE loss.

\[E(w) = -\log\frac{\exp(f_w(x)^Tf_w(x^+))}{\exp(f_w(x)^Tf_w(x^+)+\sum_{i=1}^N \exp(f_w(x_i)^Tf_w(x_i^-)))}\]

Vector Calculus

Univeriate, Multivariate, and Vector Functions.

\[f: \mathbb R^D\to\mathbb R^K\]

If $D=1$, we say $f$ is univariate. Otherwise multivariate.
If $K=1$, we say $f$ is real-valued. Otherwise vector-valued.

Example definition of multivariate real-valued function of $f(\mathbf x) = \mathbf x^T\mathbf x, \mathbf x=(x_1, x_2)$ specifically specified as

\[\begin{align*} f: \mathbb R^2\to\mathbb R \\ \mathbf x\to x_1^2+x_2^2 \end{align*}\]

Different Quotient and Derivative in Univariate Real-Valued Function For a univariate real-valued function $f(x)$, the difference quotient is given by

\[\frac{\delta f}{\delta x} = \frac{f(x+\delta x) - f(x)}{\delta x}\]

while the derivative is given by

\[\frac{df}{dx} = \lim_{\delta x\to 0} \frac{f(x+\delta x) - f(x)}{\delta x}\]

If the limit exists at then $f$ is differentiable at $x$, and the derivative is the tangent of $f$ at $x$. The derivative of $f$ at points in the direction of the steepest ascent of $f$.

Partial Derivatives in Multivariate Real-Valued Function For a multivariate real-valued function $f(\mathbf x)$, the partial derivative with respect to $x_i$ is given by

\[\begin{align*} &f: \mathbb R^n\to \mathbb R \\ &\frac{\partial f}{\partial x_1} = \lim_{\delta \to 0} \frac{f(x_1+\delta, \dots, x_D) - f(x)}{\delta} \\ &\frac{\partial f}{\partial x_i} = \lim_{\delta \to 0} \frac{f(x_1, \dots, x_i+\delta, \dots, x_D) - f(x)}{\delta} \\ \end{align*}\]

The gradient (or the Jacobian) of $f$ is the vector collecting all partial derivatives in the row vector.

\[\nabla_{\bf x} f = \frac{df}{d\bf x} = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_D}\right]\]

Basic rules of partial differentiation

\[\begin{align*} \frac{\partial}{\partial\mathbf x} (f(x)+g(x)) &= \frac{\partial f}{\partial\mathbf x} + \frac{\partial g}{\partial\mathbf x} \\ \frac{\partial}{\partial\mathbf x} (f(x)g(x)) &= \frac{\partial f}{\partial\mathbf x}g(x) + f(x)\frac{\partial g}{\partial\mathbf x} \\ \frac{\partial}{\partial\mathbf x}(g\circ f)(\mathbf x) = \frac{\partial}{\partial\mathbf x}(g(f(\mathbf x))) &= \frac{\partial g}{\partial f}\frac{\partial f}{\partial\mathbf x} \end{align*}\]

Compact writing of chain rule as matrix multiplication only makes sense if the gradient is defined as a row vector. Otherwise, we will need to introduce transposes, which complicates the operation, especially when we are dealing with tensors of higher dimensions.

Vector-Valued Functions. We generalize to $f:\mathbb R^n\to\mathbb R^m$.

\[\begin{align*} f: \mathbb R^n\to\mathbb R^m \\ \frac{\partial\mathbf f}{\partial x_i} = \begin{bmatrix} \frac{\partial f_1}{\partial x_i} \\ \vdots \\ \frac{\partial f_m}{\partial x_i} \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \end{align*}\]

The gradient of $\mathbf f$ wrt a column vector is a row vector of the partial derivatives. This is the Jacobian matrix $J_{ij}=\frac{\partial f_i}{\partial x_j}$.

Numerator Layout vs Denominator Layout. What we use before is numerator layout where i-th output function $f_i$ corresponds to the i-th row of the jacobian $\frac{d\mathbf f}{d\mathbf x}$ output matrix. Some authors use denominator layout which require transpose operation.

In Summary.

\[\begin{align*} f&:\mathbb R\to\mathbb R, \frac{df}{dx} \in R \\ f&:\mathbb R^n\to\mathbb R, \nabla_x f=\frac{df}{dx} \in R^{1\times n} \\ f&:\mathbb R\to\mathbb R^m, \frac{d\mathbf f}{dx} \in R^{m\times 1} \\ f&:\mathbb R^n\to\mathbb R^m, \nabla_x\mathbf f \in R^{m\times n} \\ \end{align*}\]

In general, for $f: \mathbb R^n\to\mathbb R^m$, the derivative is in the dimension of output times input, $\frac{d\mathbf f}{d\mathbf x} \in R^{m\times n}$.

Gradients of Matrices. Gradient of matrices w.r.t. vectors (or other matrices) are multidimensional tensors. The rule is the same, output dimension times input dimension.

\[\begin{align*} A\in \mathbb R^{m\times n}, B\in \mathbb R^{p\times q} \\ J = \frac{d\mathbf A}{d\mathbf B} \in \mathbb R^{(m\times n)\times(p\times q)}, J_{ijkl} = \frac{\partial A_{ij}}{\partial B_{kl}} \end{align*}\]

Flattening Approach to Gradients of Matrices. Given a matrix $A \in \mathbb{R}^{m \times n}$, flatten it into a vector $\text{vec}(A) \in \mathbb{R}^{mn}$. Compute the gradient $\frac{d\text{vec}(A)}{d\text{vec}(B)}$ where $B \in \mathbb{R}^{p \times q}$, and reshape the resulting vector back $\mathbb R^{(m\times n)\times(p\times q)}$.

\[\begin{align*} A\in \mathbb R^{m\times n}, B\in \mathbb R^{p\times q} \\ \text{vec}(A) \in \mathbb R^{mn}, \text{vec}(B) \in \mathbb R^{pq} \\ J = \frac{d\text{vec}(A)}{d\text{vec}(B)} \in \mathbb R^{pq\times mn} \rightarrow_{reshape} \mathbb R^{(m\times n)\times(p\times q)} \end{align*}\]

Kronecker product. Allows two matrices of arbitrary dimension to be multiplied. Given matrices $A \in R^{M\times N}$, $B \in R^{P\times Q}$, its Kronecker product $A \otimes B$ is a $(MP \times NQ)$-matrix:

\[\begin{align*} A \otimes B = \begin{bmatrix} A_{11}B & \cdots & A_{1N}B \\ \vdots & \ddots & \vdots \\ A_{M1}B & \cdots & A_{MN}B \end{bmatrix} \end{align*}\]

Vector by Matrix Derivative using Vectorization and Kronecker Product.

Kronecker Product for Tensor Derivative.

Forward and Backward Propagation

Forward Autodiff. Good for small input large output.

\[\begin{align*} \bar v_k &= \frac{\partial{v_k}}{\partial{x_c}},\text{ for all }k\text{ for some }c \\ v_i \text{ connected to } v_j &\implies \bar{v_i} = \frac{\partial v_i}{\partial x_c} = \frac{\partial v_i}{\partial v_j}\frac{\partial v_j}{\partial x_c} = \frac{\partial v_i}{\partial v_j}\bar v_j \end{align*}\]

Backward Autodiff. Good for large input small output.

\[\begin{align*} \bar v_k &= \frac{\partial{f}}{\partial{v_k}},\text{ for all }k \\ v_i \text{ connected to } v_j &\implies \bar{v_i} = \frac{\partial f}{\partial v_i} = \frac{\partial v_j}{\partial v_i} \frac{\partial f}{v_j} = \frac{\partial v_j}{\partial v_i} \bar{v_j} \end{align*}\]