Deep Learning Notes from Zero to Hero

Comprehensive notes from the Machine Learning Research Lab course @ Universitas Indonesia [In Progress]

Introduction to Deep Learning

1. Fundamental Definition of Machine Learning

Tom Mitchell (1997) provided a concise definition of machine learning:

A computer program is said to learn from experience E with respect to some task T and performance measure P, if its performance at task T, as measured by P, improves with experience E.

Illustrative Examples:

Major Categories of Machine Learning Tasks:

Deep Learning is a specialized subfield of machine learning that employs neural networks with multiple layers to solve complex tasks, particularly excelling in supervised learning scenarios.

2. Supervised Learning: Core Concepts

Supervised learning involves learning a function f: X → Y from labeled pairs (x, y).

Primary Types:

Classification:

Regression:

Real-World Applications:

3. Unsupervised Learning: Discovering Hidden Patterns

The primary goal is to learn the underlying distribution of input data without explicit labels.

Self-Supervised Learning:

A powerful approach that constructs pseudo-labels from the data itself, such as predicting the next word in a sentence.

Notable Examples:

4. Curve Fitting: A Fundamental Regression Example

Task Definition:

Predict target value t from scalar input x (e.g., sin(2πx) + noise)

Model Formulation:

Polynomial representation: y(x, w) = ∑ wⱼ x^ⱼ

Error Measurement:

Squared Loss Function: E(w) = ½ ∑ (y(xₙ, w) − tₙ)²

Model Complexity Considerations:

Detecting Overfitting:

Strategies for Controlling Overfitting:

Model Selection Process:

Probabilistic Framework

Motivation

ML models face uncertainty. Application of machine learning implies dealing with uncertainty. No machine learning model can achieve a perfect accuracy

Probability Theory

Probability Theory. Handling uncertainty requires the framework of probability theory

Frequentist view of probability. Probability is the proportion of the frequency of an event happening in the limit of an infinite number of trials.

Bayesian View of Probability. Probability is the quantification of uncertainty.

Joint Probability $P(X, Y)$. Probability of $X=x$ AND $Y=y$ denoted as $P(X=x, Y=y)=p_{X,Y}(x, y)$

Conditional probability $P(X\mid Y)$. Probability of $X=x$ given $Y=y$ denoted as $P(X=x\mid Y=y)=p_{X\mid Y}(x\mid y)$

Marginal probability $P(X)$ Probability of $X=x$ obtained via sum rule, denoted as $P(X=x)=p_X(x)=\sum_{y}p_{X,Y}(x, y)$ for discrete case or $\int_{-\infty}^\infty p_{X,Y}(x, y)dy$ in continuous case.

Product rule $P(X=x, Y=y)$. $P(X=x, Y=y)=p(Y=y\mid X=x)P(X=x)$.

Probability Distribution. A probability distribution for a random variable X is “complete specification of a way to compute probabilities of events about X.” It specify probability mass function (pmf) for discrete case and probability density function (pdf) for continuous case. Typically, pmf or pdf also depends on a set of parameters that governs the shape of pmf/pdf. Keep in mind that pmf/pdf must satisfy the requirement that they are nonnegative and normalized (their sum/integrate to 1 over all their support set). If a distribution is not normalized, it is called improper.

Bayes Theorem. The core principle in bayesian statistics, how do we measure belief after seeing some data (which is learning in some sense).

\[P(Y\mid X) = \frac{P(X\mid Y)P(Y)}{P(X)}\]
Term Name Intuition
$P(Y\mid X)$ Posterior What we want to know: belief about Y after seeing data X
$P(X\mid Y)$ Likelihood How likely is data X under hypothesis Y?
$P(Y)$ Prior Belief about Y before seeing any data
$P(X)$ Evidence (aka Marginal Likelihood) How likely is the data X overall, across all possible values of Y?

For those who confuse likelihood and probability, here are genius explanation from my bro ChatGPT.

Likelihood and probability look similar but are conceptually different depending on the context. Whether we call $P(X\mid Y)$ the likelihood depends on what is fixed and what is variable.

Term Variable Fixed Interpretation
Probability $P(X\mid Y)$ $X$ $Y$ Given model/parameter $Y$, how likely is the data $X$?
Likelihood $L(Y)=P(X\mid Y)$ $Y$ $X$ Given observed data $X$, how likely is a particular model/parameter $Y$?

Independent Event.

\[P(X,Y) = P(X)P(Y)\quad\iff\quad P(Y\mid X) = P(Y)\]

If the posterior probability equal to prior probability (Additional information does not change the prior probability), then $X$ and $Y$ are independent.

Expectation. Expectation of some function $f(x)$ is the weighted average of $f$ under a probability distribution $p(x)$.

\[\mathbb E_{p(x)}[f(x)] = \sum_x f(x)p(x) \quad\text{or}\quad \int_{-\infty}^\infty f(x)p(x)dx\]

$\mathbb E_{p(x)}[f(x, y)]$ indicates expectation of a function over several variables, and the subscript $p(x)$ denotes the distribution under which the expected value of the function is computed. In this case, $\mathbb E_{p(x)}[f(x, y)]$ becomes a function of $y$.

Conditional Expectation. Conditional expectation is defined w.r.t. a conditional distribution $p(x\mid y)$.

\[\mathbb E_{p(x|y)}[f(x)] = \sum_x p(x\mid y)f(x)\quad\text{or}\quad \int_{-\infty}^\infty p(x\mid y)f(x)dx\]

Variance. Variance of $f(x)$ measures how much the value of $f$ varies over all possible values of $x$ under a certain $p(x)$.

\[\begin{align*} Var_{p(x)}[f(x)] &= \mathbb E_{p(x)}[(f(x)-\mathbb E_{p(x)}[f(x)])^2] \\ Var_{p(x)} &= \mathbb E_{p(x)}[f(x)^2] - \mathbb E[f(x)]^2 \end{align*}\]

Covariance. measure the extent to which the two variables vary together.

\[\begin{align*} Cov_{p(x, y)}[f(x), g(y)] &= \mathbb E_{p(x, y)}[(f(x) - \mathbb E_{p(x, y)}[f(x)])(g(y) - \mathbb E_{p(x, y)}[g(y)])) \\ Cov_{p(x, y)} &= \mathbb E_{p(x, y)}[f(x)g(y)] - \mathbb E_{p(x, y)}[f(x)]\mathbb E_{p(x, y)}[g(y)] \end{align*}\]

If x and y are two vectors of random variables, their covariance is a matrix

\[Cov_{p(x, y)}[\vec x, \vec y] = \mathbb E_{p(x, y)}[xy^T] - \mathbb E_{p(x)}[x]\mathbb E_{p(y)}[y^T]\]

Note that $Cov(x, x) = Var(x)$

Example distributions

  1. Uniform
  2. Exponential
  3. Laplace
  4. Gaussian

Likelihood

Probability vs Likelihood. Probability is the chance that a particular outcome occurs based on the values of the distribution parameters. When calculating probability, we assume that the parameters are trustworthy. Likelihood is how well a sample/data provides support for particular values of the distribution parameter. When calculating likelihood, we’re trying to determine if we can trust the parameter values based on the data that we observe.

\[L(\theta\mid X=x) = P_\theta(x) = P(X=x; \theta) = f_\theta(x)\text{ for continuous pdf }f_\theta(x)\]

In continuous case, we use pdf $f$ which interpreted as density where in discrete case we use pmf $p$ which interpreted as true probability. The integral of the likelihood over all parameter values may not be equal to 1.

Computing Likelihood. Given dataset $\mathcal D={x_1,\dots, x_N}$ that is i.i.d (independent and identically distributed), the likelihood is computed as

\[L(\theta\mid \mathcal D) = \prod_{i=1}^N P(x_i;\theta) = \prod_{i=1}^N f_\theta(x_i)\]

Product/multiplication of large number of small numbers can be problematic. So we often work with log-likelihood.

\[LL(\theta) = \log L(\theta\mid \mathcal D) = \sum_{i=1}^N \log P(x_i;\theta)\]

Density estimation aka Parameter Estimation. Task of estimating the distribution, given a finite set of observations. Ill-posed because there are infinitely many distributions that could have generate the observed finite dataset.

One solution (among others) for density estimation is maximizing likelihood of the parameters with respect to the given dataset. Parameters obtained by likelihood maximization are usually called maximum likelihood estimators.

Information Theory

Shannon’s idea: relate information content to degree of surprise. It view that Surprise $\iff$ Rarity and Rarity $\iff$ Contains more information.

Measuring information. Suppose, for a discrete RV $X$, we observe that $X = x$ for a particular $x$, how much information do we receive? Let $h(x)$ be the measure of information.

  1. h(x) should be additive. If two unrelated facts x, y occur at the same time, the surprise seeing both of them should be the sum of the individual surprises. So,
\[h(x, y) = h(x) + h(y)\]

But, x and y are unrelated, hence also independent. So, $p(x, y) = p(x)p(y)$ and $h(x)$ should be logarithm of $p(x)$.

  1. $h(x)$ should be monotonic in the opposite direction to $p(x)$. $h(x)$ increases iff $p(x)$ decreases.

We obtain,

\[h(x) = -\log_b p(x)\]

Entropy: expected amount of information. Entropy H[x] of x: expected amount of information (i.e., amount of surprise) over all possible outcomes of x under distribution of x, which is p(x):

\[\mathbb H[x] = \mathbb E_{p(x)}[h(x)] = \sum_x p(x)h(x) = -\sum_x p(x)\log p(x)\]

Entropy can also be understood as a measure of uncertainty of a system.

Minimum and Maximum Entropy Problem. The minimum of H[p] is H[p] = 0 that happens when there exists some $x_i$ such that $p(x_i) = 1$ while $p(x_j) = 0$ for the other $x_j$’s. Its maximum however, is harder and more useful, choosing the most unbiased distribution given what you know.

Solving Maximum Entropy Problem using Langragian Contrained Optimization.

We aim to solve for $p(x_i)$ (discrete case) that maximize following objective \(\begin{align*} L(x, \lambda) &= f(x) + \lambda g(x) \\ \mathbb H[x] &= -\sum_{x_i} p(x_i)\log p(x_i) + \lambda\left(\sum_i p(x_i) - 1 \right) \\ \end{align*}\)

See that contrained $g(x)$ is to kept them under probability axiom (sum to 1). In the end, we would have something like $p(x_1)=p(x_2)=\dots=p(x_n)=\exp(\lambda-1)=1/K$. In other words, H[p] is maximum when p(x) is a uniform (discrete) distribution.

Entropy for continuous variable. We can extend entropy to continous variables as differential.

\[\mathbb H[x] = \lim_{\Delta\to 0}\left(-\sum_i p(x_i)\Delta\log p(x_i) \right) = -\int_x p(x)\log p(x)dx\]

One can show using calculus of variations and Lagrange function that the distribution that maximizes the differential entropy is the Gaussian.

Cross Entropy. Cross-entropy measures the difference between two probability distributions: the true distribution $p(x)$ and the predicted distribution $q(x)$. It tells you how many “nats” or “bits” you need to encode data from $p(x)$ using the coding scheme designed for $q(x)$.

\[\begin{align*} \mathbb H_p(q) &= \mathbb H(q, p) = -\sum_x q(x)\log p(x) \\ \mathbb H_p(p) &= \mathbb H[p] \end{align*}\]

Note that we use parentheses for cross entropy instead of square brackets.

Expression above is the same as saying how many bits (or nats) you need to encode data from $q(x)$ using the coding scheme designed for $p(x)$. In machine learning, the message corresponds to the true (unknown) distribution of the prediction outcome, while the code corresponds to the approximating distribution of the prediction obtained by the model.

Kullback-Leibler (KL) Divergence. The average additional amount of information (in nats) needed to specify the value $x$ (using the approximating distribution $q(x)$), is given by:

\[\mathbb {KL}_q(p\mid\mid q) = \mathbb H_q(p) - \mathbb H_p[x] = \sum_x p(x)\log\frac{p(x)}{q(x)}\]

KL divergence can be interpreted as a measure of dissimilarity between two distributions $p(x)$ and $q(x)$.

KL Divergence and Likelihood. If training data comes from the unknown distribution $p(x)$, we approximate $p(x)$ using some parametric distribution $q(x \theta)$. $\theta$ can be determined by minimizing $\mathbb {KL}(p   q)$ but $p(x)$ is unknown. Since we have training data ${x_i}$ drawn from $p(x)$, we can estimate as a finite sum over these points.
\[\begin{align*} \mathbb{KL}(p||q) &= -\int p(x)\log\frac{q(x; \theta)}{p(x)} dx \\ &= \frac{1}{N}\sum_{i=1}^N\left(-\log\frac{q(x_i;\theta)}{p(x_i)}\right) = \frac{1}{N}\sum_{i=1}^N(-\log q(x_n;\theta) + \log p(x_i)) \end{align*}\]
log $p(x_i)$ is independent of $\theta$ and $q(x_ \theta)$ is negative log-likelihood. Therefore minimizing $\mathbb{KL}(p   q)$ is equivalent to maximizing log-likelihood $\theta$ of $q$. In deep learning, KL-divergence is usually used in variational autoencoders, generative adversarial networks, and diffusion models.

Linear Regression

Task. Predict continuous target t from input $x\in R^D$

Model.

\[y(x, w) = \sum_{j=0}^{M-1} w_j\phi_j(x) = w^T\phi(x)\]

Basis Function. Linear in weights, but not necessarily in inputs.

Probabilistic Modeling. Assume noise in targets $t$ such that

\[t = y(x, w) + \epsilon\quad e\sim\mathcal N(0, \sigma^2) \\\]

Where

\[y(x, w) = w^T\phi(x)\]

Probabilistically, we can write

\[p(t\mid x, w, \sigma^2) = \mathcal N(t\mid y(x, w), \sigma^2)\]

Given a dataset of inputs $X={x_1,\dots, x_N}$ with corresponding targets $t=(t_1,\dots,t_N)^T$ where data points are i.i.d. The likelihood of parameters $w, \sigma^2$ defined as

\[\begin{align*} L(w, \sigma^2\mid X, t)&=p(t\mid X,w,\sigma^2)=\prod_{i=1}^N\mathcal N(t_i\mid w^T\phi(x_i), \sigma^2) \\ LL(w, \sigma^2\mid X, t)&= \sum_{i=1}^N N(t_i\mid w^T\phi(x_i), \sigma^2) \\ LL(w, \sigma^2\mid X, t)&= N\log \frac{1}{\sigma\sqrt{2\pi}} -\frac{1}{2}\sum_{i=1}^N \left(\frac{t_i-w^T\phi(x_i)}{\sigma}\right)^2 \\ LL(w, \sigma^2\mid X, t)&= -\frac{N}{2}\log (2\pi\sigma^2) - \frac{1}{\sigma^2}\left(\underbrace{\frac{1}{2}\sum_{i=1}^N ({t_i-w^T\phi(x_i)})^2}_\text{Sum of Squares Error}\right) \\ \end{align*}\]

if $\sigma^2$ is assumed to be fixed/constant, maximizing the log-likelihood above is equivalent to minimzing the sum-of-squares error.

The maximization of the log-likelihood in previous equation is done with respect to two parameters: $w$ and $\sigma^2$, leading to solutions for both defined as:

\[w_{MLE} = \arg\max_w\log p(t\mid X, w, \sigma^2)\quad \sigma^2_{MLE}=\arg\max_{\sigma^2} \log p(t\mid X, w, \sigma^2)\]

Both are called maximum likelihood estimators for the model. For homoskedastic linear regression model, $\sigma^2$ is actually assumed to be constant, leading to $w_{MLE}$ as a least-square solution to the sum-of-squares error.

Design Matrix. Is a matrix $X\in R^{N\times D}$ for number of data $N$ and feature dimension $D$. The basis function $\phi: R^N\to R^M$ is applied to each row of $X$ to yield a (row) vector of length $M$. $\vec y=[y_i]_i$ is also commonly used together with Design Matrix.

Maximize log likelihood. We will try to solve this using gradient in a closed form style, make it equal to 0. We will solve it in design matrix form $\Phi \in $.

\[\begin{align*} &\frac{\partial}{\partial w}\left[- \frac{N}{2}\log (2\pi\sigma^2) - \frac{1}{\sigma^2}\left(\frac{1}{2}\sum_{i=1}^N ({t_i-w^T\phi(x_i)})^2\right)\right] \\ &\frac{\partial}{\partial w}\left[- \frac{N}{2}\log (2\pi\sigma^2) - \frac{1}{\sigma^2}\left(\frac{1}{2} (\vec t-\Phi w)^T(\vec t-\Phi w)\right)\right] \\ &= -\frac{1}{2\sigma^2}\frac{\partial}{\partial w}(\vec t-\Phi w)^T(\vec t-\Phi w) \\ &= -\frac{1}{2\sigma^2}[(-\Phi)^T(\vec t-\Phi w)]^T\cdot(\vec t-\Phi w)^T(-\Phi) \\ &= -\frac{1}{2\sigma^2}(\vec t-\Phi w)^T(-\Phi)\cdot(\vec t-\Phi w)^T(-\Phi) \\ &= \frac{1}{\sigma^2}(\vec t-\Phi w)^T\Phi = 0 \\ &= (\vec t^T-w^T\Phi^T)\Phi = t^T\Phi - w^t\Phi^T\Phi =0 \\ &\iff \Phi^T\Phi w = \Phi^T t \iff w = (\Phi^T\Phi)^{-1}\Phi^T t \end{align*}\]

This is normal equation: $w_{mle} = (\Phi^T\Phi)^{-1}\Phi^T t$. Can lead to numerical difficulties when $\Phi^T\Phi$ is close to singular. Can be addressed using SVD or sequential or online stochastic gradient descent algorithms.

Regularized Least Square Solving for Overfitting Problem. We change the objective to minimize for

\[\begin{align*} &E_D(w) + \lambda E_W(w) \\ &=\frac{1}{2}\sum_i (t_i-w^T\phi(x_i))^2 + \frac{\lambda}{2}w^Tw \\ \end{align*}\]

We use $\ell_2$ regularizer for $E_w$. This method also called as weight decay allowing our models to turning off some of the parameters $w$.

Multiple Regression. Conditional distribution of the $K$-dimensional target vector $\vec t$.

\[\begin{align*} p(\vec t\mid x, W, \sigma^2) &= \mathcal N(\vec t\mid W^T\phi(x), \sigma^2 I) \\ &= \frac{1}{\sqrt{(2\pi)^K\vert\sigma^2 I\vert}}\exp\left(-\frac{1}{2}(\vec t - W^T\phi(x))^T(\sigma I)^{-1}(\vec t - W^T\phi(x))\right) \end{align*}\]

The above Gaussian has:

The Gaussian is assumed isotropic: equal variance in each dimension (thus equal density in every direction) and all dimensions are independence. So, the shape will be spherical and aligned at all axes.

Decision Theory

Approximating Distribution to Decision Probablistic model $p(t x, w_{ML}, \sigma^2_{ML}) = N (t y(x, w_{ML}), \sigma^2_{ML})$ is a predictive distribution expressing uncertanity over value of t for some new input x. But often we need to predict specific value for t, rather than returning the distribution, especially if we must take a specific action. Example, If our aim is to determine the optimal level of radiation to use for treating tumor and model predicts a probability distribution over radiation dose, then we need to use that distribution to determine the specific dose to be administered. We explain this formally using the notion of decision theory.

Formal Objective. Suppose we choose $f(x)$ for our prediction when the true value is $t$. Then, this choice incur a penalty/cost given by a loss function $L(t, f(x))$. We define the objective

\[\mathbb E[l] = \int\int L(t, f(x))\,dx\,dt\]

Example of $L(t, f(x))$ is squared loss.

\[L(t, f(x)) = (f(x)-t)^2\]

Choosing decision function f(x). Minimizing above expected loss using squared loss, using calculus of variations, we obtain.

\[f^*(x) = \int tp(t\mid x)\,dt=\mathbb E_t[t\mid x]\]

$f^*(x) = \mathbb E_t[t\mid x]$ is called the regression function.

Which is just conditional expectation of $t$ given $x$. For Gaussian conditional distribution, this becomes the following.

\[\mathbb E[t\mid x] = \int tp(t\mid x)\,dt = y(x, w)\]

Bias Variance Tradeoff

Derivation of Bias Variance Tradeoff.

\[\mathbb E[L] = \int (f(x)-\mathbb E[t\mid x])^2p(x) dx + \int var(t\mid x)p(x)dx\]

Note that term $\mathbb E[t\mid x]$ is just one choice of the regression function. Second term is independent of $f(x)$ arises from intrinsic noise of the data – represents minimum value achievable for expected loss.

Consider the first integrand, done over a particular dataset $D$: $(f(x; D) - h(x))2$. We do experiments over the ensemble of datasets, so we can take the average of the above over many possible datasets. Consider we model $\mathbb E[t\mid x]=h(x)$.

There will be 3 terms with the last term will vanish. The first term is the squared bias of the model. Second term is variance of the

We will ended up with expected $loss =\text{ bias }^2 +\text{ variance }+\text{ irreducible noise}$.

Classification

Task. Predicting a class label $y\in{1,\dots, K}$ from input $x\in R^D$.

Three approaches to solve classification task.

  1. Discriminant function: directly assigns $x$ to a specific class.
  2. Discriminative probabilistic model: train the conditional distribution $p(C_k x)$ directly using a training set, and then use the model to decide the class of $x$.
  3. Generative probabilistic model: model the class-conditional density $p(x C_k)$, and the class prior $p(C_k)$, then compute the posterior $p(C_k x)$ via Bayes theorem; after that, use the posterior distribution to decide the class of $x$.

Maximum Likelihood on Classification Task. Likelihood is defined the same regardless of regression or classification. All that matters is our estimator definition. Suppose we model classification task as below.

\[p(y = c\mid x; w) = f_c(x; w)\]

where $f_c(x; w)$ is a parametric function. Likelihood is in the form of.

\[\begin{align*} L(w\mid X, y) = \prod_{i=1}^N p(y_i = c_i\mid x_i; w) \\ LL(w\mid X, y) = \sum_{i=1}^N \log p(y_i = c_i\mid x_i; w) \end{align*}\]

Linear discriminant function: Two-class case. Suppose we model

\[y(x) = w^Tx + w_0\]

The decision to assign $x$ into class $\hat y$ is

\[\hat y = \begin{cases} C_1\quad\text{if } y(x) \ge 0\\ C_2\quad\text{otherwise} \end{cases}\]

Decision boundary corresponds to the $y(x)=0$, $D-1$ dimensional hyperplane within $D$ dimensional input space.

Decision boundary is orthogonal to w. Under the same model above, we could prove that for any vector $(x_0-x_1)$ such that $y(x_0) = y(x_1) = 0$ (they are on the same side of the decision boundary), they are orthogonal with $w$.

\[\begin{align*} x_0, x_1 : y(x_0) = y(x_1) = 0 \\ w^Tx_0 = w^Tx_1 \implies w^T(x_0 - x_1) = 0 \\ \langle w, x_0-x_1\rangle = 0 \end{align*}\]

where $\langle \cdot, \cdot \rangle$ is the standard dot product. This property holds true for any two points $x_0, x_1$ on the decision boundary. This implies that the distance between any $x$ and the decision boundary is $r\frac{w}{\Vert w\Vert}, r\in R$.

Multiple classes. The trick above (Defining decision boundary $y_c(x)>0$) will draw a single line for single class $c$.

Using Least Square to find $W$. For one hot vector $t_i\in R^K$ with prediction $w^T\phi(x_i)\in R^K$, we stack into the matrix form $T=[t_i]_i\in R^{N\times K}, X=[\phi(x_i)]_i\in R^{N\times D}, W=[w^T_i]_i \in R^{D\times K}$ (essentially just stacking vertically).

\[\begin{align*} E_D(w) &= \frac{1}{2}\sum_i^N \Vert w^T\phi(x_i) - t_i \Vert^2\\ E_D(w) &= \frac{1}{2}\sum_i^N \sum_j^K (w^T\phi(x_i)_j - t_{ij})^2\\ E_D(w) &= \frac{1}{2}\sum_i^N \sum_j^K {\Vert XW-T \Vert^2_F}_{ij}\\ E_D(w) &= \frac{1}{2}{\Vert XW-T \Vert^2_F} \\ E_D(w) &= \frac{1}{2} \operatorname{Tr}\left((XW-T)^T(XW-T)\right) \\ \end{align*}\]

Decision Theory

Prediction and Action. Prediction in supervised learning is modeled probabilistically as a posterior distribution $p(y\mid x)$. Intuitively, our follow up action for prediction is to choose the value of $y$ that makes $p(y\mid x)$ the highest. But is this mathematically justified?

Mathematical Justification. We assume an agent / decision maker has set of possible action $a\in\mathcal A$. Each action $a$ has costs and benefits that depend on the state (hidden) of nature $y$. These dependencies can be modeled as a loss function $\ell(y, a)$ that gives the loss incurred if the agent takes action $a$ when the state of nature is $y$.

Posterior Expected Risk. We define the posterior expected risk as

\[R(a\mid x) = \mathbb E_{p(y\mid x)}[\ell(y, a)] = \sum_{y\in\mathcal Y} p(y\mid x)\ell(y, a)\]

Optimal Policy (Bayes Estimator). The optimal policy is the one that minimizes the posterior expected risk.

\[a^* = \arg\min_{a\in\mathcal A} R(a\mid x)\]

Alternative: Use utility function, Maximum Expected Utility Principle. Define $U(y, a)$ expressing desirability of each possible action and state then seek action that maximize expected utility.

\[a^* = \arg\max_{a\in\mathcal A} \mathbb E_{p(y\mid x)}[U(y, a)]\]

Decision on Classification, Zero-one loss for Binary Classification. Assign loss 1 if action is wrong, 0 otherwise.

\[\ell_{zo}(y, a) = \mathbb I\{y\ne a\}\]
$a$ $y=0$ $y=1$
$a=0$ 0 1
$a=1$ 1 0

Posterior expected loss for each action $a$ is

\[\begin{align*} R(a\mid x) &= \sum_{y\in\mathcal Y} p(y\mid x)\ell_{zo}(y, a) \\ &= \underbrace{p(y_0\mid x)}_{\text{When }a=1} + \underbrace{p(y_1\mid x)}_{\text{When }a=0} \\ &= p(y\neq a\mid x) = 1-p(y=a\mid x) \end{align*}\]

Optimal policy is

\[\begin{align*} a^* &= \arg\min_{a\in\{0,1\}} R(a\mid x) \\ &= \arg\min_{a\in\{0,1\}} 1-p(y=a\mid x) \\ &= \arg\max_{a\in\{0,1\}} p(y=a\mid x) \end{align*}\]

Maximum a posteriori (MAP) estimate. This maximization decision rule is corresponds to the mode of the posterior distribution. Specifically for binary case:

\[a^* = \begin{cases} 0&\text{if }p(y=0\mid x)\ge p(y=1\mid x) \\ 1&\text{otherwise} \end{cases}\]

Cost Sensitive Classification. Instead of zero-one loss, we may have the following cost matrix where $\ell_{ij}$ is the cost of predicting $j$ when true label is $i$.

Predicted label a = 0 a = 1
True label    
y = 0 $\ell_{00}$ $\ell_{01}$
y = 1 $\ell_{10}$ $\ell_{11}$
\[\begin{align*} R(a\mid x) &= \sum_{y\in\mathcal Y} p(y\mid x)\ell_{y a} \\ R(a\mid x) &= \ell_{00}p(y=0\mid x) + \ell_{01}p(y=0\mid x) + \ell_{10}p(y=1\mid x) + \ell_{11}p(y=1\mid x) \\ R(a = 0\mid x) &= \ell_{00}p(y=0\mid x) + \ell_{10}p(y=1\mid x) \\ R(a = 1\mid x) &= \ell_{01}p(y=0\mid x) + \ell_{11}p(y=1\mid x) \end{align*}\]

Decision rule is to pick $a$ that minimizes the posterior expected loss:

\(\begin{align*} a^* &= \arg\min_{a\in\{0,1\}} R(a\mid x) \\ &= \min\left\{\underbrace{\ell_{00}p(y=0\mid x) + \ell_{10}p(y=1\mid x)}_{a=0},\;\underbrace{\ell_{01}p(y=0\mid x) + \ell_{11}p(y=1\mid x)}_{a=1}\right\}\\ \end{align*}\) Which turn into this decision rule

\[\begin{align*} \ell_{00}p(y=0\mid x) + \ell_{10}p(y=1\mid x) &\dots \ell_{01}p(y=0\mid x) + \ell_{11}p(y=1\mid x) \\ (\ell_{00}-\ell_{01})p(y=0\mid x) + (\ell_{10}-\ell_{11})p(y=1\mid x) &\dots 0 \\ (\ell_{00}-\ell_{01})p(y=0\mid x) &\dots (\ell_{11}-\ell_{10})p(y=1\mid x) \\ \frac{p(y=0\mid x)}{p(y=1\mid x)} &\dots \frac{\ell_{11}-\ell_{10}}{\ell_{00}-\ell_{01}} \\ a = \begin{cases} 0&\text{if }\frac{p(y=0\mid x)}{p(y=1\mid x)}\ge\frac{\ell_{11}-\ell_{10}}{\ell_{00}-\ell_{01}} \\ 1&\text{otherwise} \end{cases} \end{align*}\]

Reject Option Classification. Sometimes we want to say “I don’t know” instead of returning an answer (called the reject option). The loss will then become

\[\ell l(y, a) = \begin{cases} 0&\text{if }a=\hat y\\ \lambda_r&\text{if }a=\text{reject}\\ \lambda_e&\text{otherwise} \end{cases}\]

The optimal decision rule that minimizes posterior risk is (prove as exervise)

\[a^* = \begin{cases} \arg\max\limits_{y\in\{1,\dots,C\}} p(y|x) & \text{if } \max\limits_{y\in\{1,\dots,C\}} p(y|x) > \lambda = 1 - \frac{\lambda_r}{\lambda_e} \\ \text{reject} & \text{otherwise} \end{cases}\]

Evaluation Matrix

Confusion Matrix. We can define the number of true positives (TP), false positive (FP), true negatives (TN), and false negatives (FN) with respect to the threshold $\tau$:

\[\begin{align*} TP &= \sum_{i=1}^N \mathbb I\{y_i = 1, \hat y_i = 1\} \quad TN &= \sum_{i=1}^N \mathbb I\{y_i = 0, \hat y_i = 0\} \\ FP &= \sum_{i=1}^N \mathbb I\{y_i = 0, \hat y_i = 1\} \quad FN &= \sum_{i=1}^N \mathbb I\{y_i = 1, \hat y_i = 0\} \end{align*}\]

Thus we can define the following confusion matrix:

Prediction 0 1 Row sum
Truth      
0 TN FP $N$
1 FN TP $P$
Col sum $\hat N$ $\hat P$ $N + P$

Metric from Confusion Matrix.

\[\begin{align*} \text{Accuracy} &= \frac{TP + TN}{N} = \frac{TP + TN}{TP + TN + FP + FN} \\ \end{align*}\]

Metrics w.r.t. rowwise normalization (i.e., truths).

\[\quad TNR_\tau = \frac{TN_\tau}{N_\tau} = \frac{TN_\tau}{FP_\tau + TN_\tau} = p(\hat y = 0|y = 0; \tau)\] \[FPR_\tau = \frac{FP_\tau}{N_\tau} = \frac{FP_\tau}{FP_\tau + TN_\tau} = p(\hat y = 1|y = 0; \tau)\] \[FNR_\tau = \frac{FN_\tau}{P_\tau} = \frac{FN_\tau}{TP_\tau + FN_\tau} = p(\hat y = 0|y = 1; \tau)\] \[TPR_\tau = \frac{TP_\tau}{P_\tau} = \frac{TP_\tau}{TP_\tau + FN_\tau} = p(\hat y = 1|y = 1; \tau)\]

Metrics w.r.t. columnwise normalization (i.e., predictions).

\[NPV_\tau = \frac{TN_\tau}{\hat N_\tau} = \frac{TN_\tau}{TN_\tau + FN_\tau} = p(y = 0|\hat y = 0; \tau)\] \[FOR_\tau = \frac{FN_\tau}{\hat N_\tau} = \frac{FN_\tau}{TN_\tau + FN_\tau} = p(y = 1|\hat y = 0; \tau)\] \[FDR_\tau = \frac{FP_\tau}{\hat P_\tau} = \frac{FP_\tau}{FP_\tau + TP_\tau} = p(y = 0|\hat y = 1; \tau)\] \[Prec_\tau = PPV_\tau = \frac{TP_\tau}{\hat P_\tau} = \frac{TP_\tau}{FP_\tau + TP_\tau} = p(y = 1|\hat y = 1; \tau)\]

Choosing from a number of different thresholds. With ROC curves, we can consider a set of different thresholds and compare the resulting performance. For each threshold $\tau$, we plot $FPR_\tau$ versus $TPR_\tau$ as a point and draw a curve connecting them. This gives the receiver operating characteristic (ROC) curve.

ROC curve is insensitive to changes in class distribution. Changes in the proportion of positive to negative instances in the (test) set do not cause change to the curve.

Precision-Recall (PR) curve.

F-scores. $F_\beta$ score is a combined statistics between precision (Prec) and recall (Rec).Useful for looking at a single threshold. F-score is affected by class distribution changes.

Classification Modeling

Three approaches for training and inference.

  1. Generative Models. Train class conditional densities $p(x\mid C_k)$ for each class $C_k$ individually. Then use Bayes theorem to compute the posterior probabilities:
\[p(C_k\mid x) = \frac{p(x\mid C_k)p(C_k)}{p(x)} = \frac{p(x\mid C_k)p(C_k)}{\sum_{i=1}^C p(x\mid C_i)p(C_i)}\]

“Generative” because by sampling from them, it is possible to generate synthetic data points in the input space.

  1. Deterministic Models. Train posterior class probabilities $p(C_k\mid x)$. Then, use decision theory to perform inference to assign each new $x$ to one of the classes.

  2. Discriminative Models. Find a discriminant function $f(x)$ that maps each input $x$ directly onto a class label, e.g., for binary classification $f$ might be binary valued such that $f = 0$ represents class $C_1$ and $f = 1$ represents class $C_2$.

Why compute posterior instead of discriminant function directly?.

Discriminative Classifiers, Logistic Regression. Intuitively, we classify

\[f(x) = \begin{cases} 1&\text{if }w^Tx + b > 0 \\ 0&\text{otherwise} \end{cases} = \mathbb I\left(\log \frac{p(y=1\mid x)}{p(y=0\mid x)}>0\right)\]

We assume the data is linearly separable using above equation. In another form, we could formulate $f(x)$ as a log-odds.

\[\begin{align*} & w^Tx + b = \log\underbrace{\frac{p(y=1\mid x)}{p(y=0\mid x)}}_{\text{Log Odds}} \\ &\iff p(y=1\mid x) = \frac{1}{1+\exp(-w^Tx+b)} \end{align*}\]

The aforementioned model is called (binary) logistic regression

\[p(y=1\mid x) = \sigma(w^Tx+b)\]

MLE for Logistic Regression. \(\begin{align*} NLL(w) &= -\frac{1}{N} \log p(D|w) = -\frac{1}{N} \log \prod_{n=1}^N Ber(y_n|\sigma(a_n)) \\ &= -\frac{1}{N} \sum_{n=1}^N \log[\sigma(a_n)^{y_n} \cdot (1-\sigma(a_n))^{1-y_n}] \\ &= -\frac{1}{N} \sum_{n=1}^N [y_n \log \sigma(a_n) + (1-y_n) \log(1-\sigma(a_n))] \\ &= \frac{1}{N} \sum_{n=1}^N H(y_n,\sigma(a_n)) \\ \nabla_w NLL(w) &= \frac{1}{N} \sum_{i=1}^N (\sigma(a_i)-y_i)x_i \\ \end{align*}\)

Cons that occurs in linearly separable datasets:

Multinomial Softmax. We generalize to multiple class $C={1, 2, 3, \dots}$ based on the following idea

\[f_c(x; w_c) = \mathbb I(p(y=c\mid x)>p(y=1\mid x))\]

Long story short, this implies

\[p(y=k\mid x) = \frac{\exp(w_k^Tx)}{\sum_{i=1}^C \exp(w_i^Tx)}\]

Note that for every class $k$, it has its own weight $w_k$. The value of $p(y = c\mid x)$ is invariant under translation in by the same value in each dimension.

\[\mathcal S(Wx) = \left[\frac{e^{w_1^Tx}}{\sum_{i=1}^C e^{w_i^Tx}}, \dots, \frac{e^{w_C^Tx}}{\sum_{i=1}^C e^{w_i^Tx}}\right]^T\]

Since $\mathcal S(Wx)$ is a vector of C probability values, softmax regression is based on a categorical distribution M whose parameters are precisely those C probability values:

\[p(y\mid x; \theta) = \mathcal M(y\mid \theta), \theta=\mathcal S(Wx)\]

Log Sum Trick. When working with softmax distribution, we often want to compute the normalized probability. Define $a_i=w_i^Tx$

\(p(y=c\mid x) = \frac{\exp(a_c)}{\sum_{i=1}^C \exp(a_i)} = \frac{\exp(a_c)}{Z(a)}\) where $Z(a)$ is the sum defined the same across all $c$.

Its computation may not be numerically stable.

Lets derive a better numerical version.

\[\begin{align*} p(y=c\mid x) &= \frac{\exp(a_c)}{\sum_{i=1}^C \exp(a_i)} = \frac{\exp(a_c)}{ \exp \log \sum_{i=1}^C \exp(a_i)} \\ &= \exp\left(a_c - \log \sum_{i=1}^C \exp(a_i)\right) \\ &= \exp\left(a_c - \log (\exp(m)\sum_{i=1}^C \exp(a_i-m))\right) \\ &= \exp\left(a_c - m + \log \sum_{i=1}^C \exp(a_i-m)\right) \end{align*}\]

We choose $m = \max_i(a_i)$ so that the largest exponential will be zero. Even if there’s underflow, the answer is still sensible.

Deep Neural Networks

Motivation

Motivation. When one uses linear models, one strongly assumes that the input-output mapping is linear. Model can be made more flexible by performing non linear feature transformation. But hand-crafting such transformation is very restrictive: curse of dimensionality. Real data does not live in all regions of the input space.

Data-dependent basis functions.

Hierarchical Non Linearity.

Multilayer Perceptions Formulation.

\[z_j^{(\ell)}= h(a_j^{(\ell)}) = h\left(\sum_{i=1}^{n_{\ell-1}} w_{ij}^{(\ell)} z_i^{(\ell-1)} + b_j^{(\ell)}\right) = h^\ell(W^{(\ell)}z^{(\ell-1)})\]

DNN as universal approximator, no free lunch theorem, and inductive bias

Heteroskedastic and Homoskedastic non linear regression.

Representation, Transfer, Multitask, and Meta Learning.

Contrastive Learning. Learn representation such that positive input pairs are close and negative input pairs are far part in embedding space. So error function is defined wrt inputs. Example is infoNCE loss.

\[E(w) = -\log\frac{\exp(f_w(x)^Tf_w(x^+))}{\exp(f_w(x)^Tf_w(x^+)+\sum_{i=1}^N \exp(f_w(x_i)^Tf_w(x_i^-)))}\]

Vector Calculus

Univeriate, Multivariate, and Vector Functions.

\[f: \mathbb R^D\to\mathbb R^K\]

Example definition of multivariate real-valued function of $f(\mathbf x) = \mathbf x^T\mathbf x, \mathbf x=(x_1, x_2)$ specifically specified as

\[\begin{align*} f: \mathbb R^2\to\mathbb R \\ \mathbf x\to x_1^2+x_2^2 \end{align*}\]

Different Quotient and Derivative in Univariate Real-Valued Function For a univariate real-valued function $f(x)$, the difference quotient is given by

\[\frac{\delta f}{\delta x} = \frac{f(x+\delta x) - f(x)}{\delta x}\]

while the derivative is given by

\[\frac{df}{dx} = \lim_{\delta x\to 0} \frac{f(x+\delta x) - f(x)}{\delta x}\]

If the limit exists at then $f$ is differentiable at $x$, and the derivative is the tangent of $f$ at $x$. The derivative of $f$ at points in the direction of the steepest ascent of $f$.

Partial Derivatives in Multivariate Real-Valued Function For a multivariate real-valued function $f(\mathbf x)$, the partial derivative with respect to $x_i$ is given by

\[\begin{align*} &f: \mathbb R^n\to \mathbb R \\ &\frac{\partial f}{\partial x_1} = \lim_{\delta \to 0} \frac{f(x_1+\delta, \dots, x_D) - f(x)}{\delta} \\ &\frac{\partial f}{\partial x_i} = \lim_{\delta \to 0} \frac{f(x_1, \dots, x_i+\delta, \dots, x_D) - f(x)}{\delta} \\ \end{align*}\]

The gradient (or the Jacobian) of $f$ is the vector collecting all partial derivatives in the row vector.

\[\nabla_{\bf x} f = \frac{df}{d\bf x} = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_D}\right]\]

Basic rules of partial differentiation

\[\begin{align*} \frac{\partial}{\partial\mathbf x} (f(x)+g(x)) &= \frac{\partial f}{\partial\mathbf x} + \frac{\partial g}{\partial\mathbf x} \\ \frac{\partial}{\partial\mathbf x} (f(x)g(x)) &= \frac{\partial f}{\partial\mathbf x}g(x) + f(x)\frac{\partial g}{\partial\mathbf x} \\ \frac{\partial}{\partial\mathbf x}(g\circ f)(\mathbf x) = \frac{\partial}{\partial\mathbf x}(g(f(\mathbf x))) &= \frac{\partial g}{\partial f}\frac{\partial f}{\partial\mathbf x} \end{align*}\]

Compact writing of chain rule as matrix multiplication only makes sense if the gradient is defined as a row vector. Otherwise, we will need to introduce transposes, which complicates the operation, especially when we are dealing with tensors of higher dimensions.

Vector-Valued Functions. We generalize to $f:\mathbb R^n\to\mathbb R^m$.

\[\begin{align*} f: \mathbb R^n\to\mathbb R^m \\ \frac{\partial\mathbf f}{\partial x_i} = \begin{bmatrix} \frac{\partial f_1}{\partial x_i} \\ \vdots \\ \frac{\partial f_m}{\partial x_i} \end{bmatrix} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \end{align*}\]

The gradient of $\mathbf f$ wrt a column vector is a row vector of the partial derivatives. This is the Jacobian matrix $J_{ij}=\frac{\partial f_i}{\partial x_j}$.

Numerator Layout vs Denominator Layout. What we use before is numerator layout where i-th output function $f_i$ corresponds to the i-th row of the jacobian $\frac{d\mathbf f}{d\mathbf x}$ output matrix. Some authors use denominator layout which require transpose operation.

In Summary.

\[\begin{align*} f&:\mathbb R\to\mathbb R, \frac{df}{dx} \in R \\ f&:\mathbb R^n\to\mathbb R, \nabla_x f=\frac{df}{dx} \in R^{1\times n} \\ f&:\mathbb R\to\mathbb R^m, \frac{d\mathbf f}{dx} \in R^{m\times 1} \\ f&:\mathbb R^n\to\mathbb R^m, \nabla_x\mathbf f \in R^{m\times n} \\ \end{align*}\]

In general, for $f: \mathbb R^n\to\mathbb R^m$, the derivative is in the dimension of output times input, $\frac{d\mathbf f}{d\mathbf x} \in R^{m\times n}$.

Gradients of Matrices. Gradient of matrices w.r.t. vectors (or other matrices) are multidimensional tensors. The rule is the same, output dimension times input dimension.

\[\begin{align*} A\in \mathbb R^{m\times n}, B\in \mathbb R^{p\times q} \\ J = \frac{d\mathbf A}{d\mathbf B} \in \mathbb R^{(m\times n)\times(p\times q)}, J_{ijkl} = \frac{\partial A_{ij}}{\partial B_{kl}} \end{align*}\]

Flattening Approach to Gradients of Matrices. Given a matrix $A \in \mathbb{R}^{m \times n}$, flatten it into a vector $\text{vec}(A) \in \mathbb{R}^{mn}$. Compute the gradient $\frac{d\text{vec}(A)}{d\text{vec}(B)}$ where $B \in \mathbb{R}^{p \times q}$, and reshape the resulting vector back $\mathbb R^{(m\times n)\times(p\times q)}$.

\[\begin{align*} A\in \mathbb R^{m\times n}, B\in \mathbb R^{p\times q} \\ \text{vec}(A) \in \mathbb R^{mn}, \text{vec}(B) \in \mathbb R^{pq} \\ J = \frac{d\text{vec}(A)}{d\text{vec}(B)} \in \mathbb R^{pq\times mn} \rightarrow_{reshape} \mathbb R^{(m\times n)\times(p\times q)} \end{align*}\]

Kronecker product. Allows two matrices of arbitrary dimension to be multiplied. Given matrices $A \in R^{M\times N}$, $B \in R^{P\times Q}$, its Kronecker product $A \otimes B$ is a $(MP \times NQ)$-matrix:

\[\begin{align*} A \otimes B = \begin{bmatrix} A_{11}B & \cdots & A_{1N}B \\ \vdots & \ddots & \vdots \\ A_{M1}B & \cdots & A_{MN}B \end{bmatrix} \end{align*}\]

Vector by Matrix Derivative using Vectorization and Kronecker Product.

Kronecker Product for Tensor Derivative.

Forward and Backward Propagation

Forward Autodiff. Good for small input large output.

\[\begin{align*} \bar v_k &= \frac{\partial{v_k}}{\partial{x_c}},\text{ for all }k\text{ for some }c \\ v_i \text{ connected to } v_j &\implies \bar{v_i} = \frac{\partial v_i}{\partial x_c} = \frac{\partial v_i}{\partial v_j}\frac{\partial v_j}{\partial x_c} = \frac{\partial v_i}{\partial v_j}\bar v_j \end{align*}\]

Backward Autodiff. Good for large input small output.

\[\begin{align*} \bar v_k &= \frac{\partial{f}}{\partial{v_k}},\text{ for all }k \\ v_i \text{ connected to } v_j &\implies \bar{v_i} = \frac{\partial f}{\partial v_i} = \frac{\partial v_j}{\partial v_i} \frac{\partial f}{v_j} = \frac{\partial v_j}{\partial v_i} \bar{v_j} \end{align*}\]