Reinforcement Learning Notes from Zero to Hero

Taken on MRL course @ Universitas Indonesia [On progress]

Overview of the Field

Understanding Reinforcement Learning Categories is a Must. Reinforcement Learning (RL) spans a wide range of algorithms, each based on different assumptions, optimization principles, and feedback structures. Unlike supervised learning, the diversity of RL approaches—each requiring different theory, tools, and implementation strategies—makes the learning curve steep but rewarding.

Picture taken from S. L. Brunton (2022, January 3), Overview of Methods

Model-Based vs Model-Free: Availability of Transitions. A core distinction in RL is whether the transition dynamics—represented as $P(s’ \mid s, a)$, the probability of ending up in state $s’$ after taking action $a$ in state $s$—are known. In model-based RL, these dynamics are known or learned and explicitly used to plan optimal behavior. In model-free RL, we assume no access to this underlying probability, which reflects most real-world situations.

Model-Free RL: Learning Without Transition Knowledge. When we don’t know $P(s’ \mid s, a)$, we resort to model-free methods, which learn from direct interaction. These approaches can be further divided into:

  1. Gradient-Free Methods, where updates are not derived from explicit gradients.
    • Off-Policy methods (e.g., Q-learning, DQN) learn an optimal policy (action determiner) from data generated by a different behavior policy, allowing better exploration.
    • On-Policy methods (e.g., SARSA, TD learning) learn from and about the same policy, often with more stable but sample-inefficient updates.
  2. Gradient-Based Methods, such as Policy Gradient approaches, optimize expected cumulative reward via gradient ascent in the parameter space of a policy network. \(\theta_{new} = \theta_{old} + \alpha\nabla_\theta\mathbb E[R_{\Sigma, \theta}]\)

Deep RL: A Revolution in Expressiveness. Deep Reinforcement Learning integrates deep neural networks with RL algorithms to handle high-dimensional state and action spaces. Deep RL enable breakthroughs in games (e.g., Atari, Go), robotics, and complex control systems. Deep RL also blurs traditional boundaries, DQNs bring deep learning to value-based methods, while Actor-Critic methods combine policy gradients and value estimation. This integration brings power, but also introduces stability and sample efficiency challenges.

Problem Formulation: Markov Decision Process

This section is a personal revision on that summarized a week lecture at University of Indonesia Machine Reinforcement Learning by Vektor Dewanto.

Picture taken from Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, 2018.

Formal Definition of MDP. The Markov Decision Process (MDP) is a foundational framework used across disciplines—from AI and control theory to economics and operations research. An MDP is defined as the tuple $M = (\mathcal S, \mathcal A, \mathcal T , \dot p, p, r)$, where:

Remember that $s^0$ represent state 0, $s_1^0$ represent state 0 at time $t=1$.

Lets Assume Stationary MDPs for Simplicity. In this setting, the state and action sets remain unchanged over time, and our goal is to find a stationary optimal policy. Formally, this means. State space, Action space, transition probability, and reward function is independent of time.

\[\mathcal S_t = \mathcal S,\quad \mathcal A_t = \mathcal A,\quad p_t=p,\quad r_t=r\]

This relaxation could be imagine in chess as, the available actions for any board configuration never change, which simplifies modeling. In contrast, non-stationary MDPs allow these components to vary over time reflects many real-world problem.

Assuming Infinite Horizon and Discrete Time. In addition to stationarity, we often assume an infinite time horizon (i.e., the agent interacts with the environment indefinitely) and discrete time steps. Finite-horizon MDPs are also studied especially in planning problems, but often lead to time-dependent policies. We also assume finite state and action spaces, which enables representations in finite computer’s memory and guarantees convergence.

Markovian: The Markov Assumption on Time Dependency. A core assumption of MDPs is the Markov property, which states that the future state depends only on the current state and action, not on the entire history. This simplifies the transition function to:

\[\Pr(S_{t+1}\mid s_t, a_t)\]

rather than

\[\Pr(S_{t+1}\mid s_0, a_0, s_1, a_1, \dots, s_t, a_t)\]

This also often called the first-order Markov assumption. While real-world scenarios often violate this assumption, we either redefine the state to be more informative or use memory-based models (e.g., RNNs in Deep RL) when necessary.

MDP Visual Interpretation. A helpful intuition for this is shown in the diagram below. The agent transitions between various states like “Wake Up”, “Netflix”, “Code & Debug”, and “Deploy”, receiving different rewards depending on its action.

This environment is Markovian: the next state and reward depend only on the current state and action—not on how the agent reached that state. For example, regardless of whether the agent arrived at “Code & Debug” from “Nap” or “Wake Up”, the outcome of choosing “Sleep” will always transition to “Sleep” with a reward of 3.

Policy: The Solution to the MDP. Policy $\pi:\mathcal S\times \mathcal A\to[0, 1]$ defines the agent’s behavior by specifying the probability of choosing each action in each state.

\[\pi(a\mid s)=\Pr(A_t=a\mid S_t=s)\]

There are two main classes of policies:

  1. $\prod_{SR}:$ Represent set of Stationary Randomized/Stochastic policies.
  2. $\prod_{SD}:$ Represent set of Stationary Deterministic policies, one action with probability 1, will always be chosen.

A deterministic policy $\pi_{SD}\in\prod_{SD}:\mathcal S\to\mathcal A$ maps single state to single action. It can be viewed as a special case of stochastic policy.

\[\pi(a\mid s) = \begin{cases} 1\quad\text{if }a=\pi_{SD}(a) \\ 0\quad\text{otherwise} \end{cases}\]

Showing that $\prod_{SD}\subset\prod_{SR}$

Matrix Representation of a Policy. A policy can also be represented as a matrix, where each row corresponds to a state and each column corresponds to an action: \(\pi = \begin{bmatrix} \pi(a^0\mid s^0) & \pi(a^1\mid s^0) & \dots & \pi(a^{\vert\mathcal A\vert}\mid s^0) \\ \pi(a^0\mid s^1) & \pi(a^1\mid s^1) & \dots & \pi(a^{\vert\mathcal A\vert}\mid s^1) \\ \vdots & \vdots & \ddots & \vdots \\ \pi(a^0\mid s^{\vert\mathcal S\vert}) & \pi(a^1\mid s^{\vert\mathcal S\vert}) & \dots & \pi(a^{\vert\mathcal A\vert}\mid s^{\vert\mathcal S\vert}) \\ \end{bmatrix} \in R^{\mathcal \vert S\vert \times\mathcal \vert A\vert}\)

We see that in ith-row, $\pi(\cdot, s^i)$, represent action distribution at state $s^i$, the best action is the maximum among this rows. Please also note that sum over actions must be one $\sum_a \pi(a\mid s)=1$.

Markov Chain: Unique Action MDP. Every MDP with a fixed policy $\pi$ induces a Markov Chain (MC). A Markov Chain is an MDP where each state has a unique action. Because $p(s’\mid s, a)$ and we fix $\pi$, $p(s’\mid s, \pi(s)):\mathcal S\times\mathcal S\to[0, 1]$ could be derived called stochastic transition matrix or one-step state-transition matrix denoted as $P\in R^{\vert\mathcal S\vert\times\vert\mathcal S\vert}$ matrix. Formally

\[\begin{align} P_\pi(s_{t+1}\mid s_t) &= \mathbb E_{a\sim \pi}[ p(s_{t+1}\mid s_t, a) ] \\ P_\pi(s_{t+1}\mid s_t) &= \sum_{a\in\mathcal A} \pi(a\mid s_t)\cdot p(s_{t+1}\mid s) \\ P_{\pi, ij} = P_{ij} &= P_\pi(s_j, s_i)\quad\text{Matrix Form} \end{align}\]

$P_{ij}$ describes the probability of ending at $s^j$ given current state $s^i$.

$P$ is stochastic meaning that

  1. Probability entries $0\le P_{ij}\le 1$
  2. $\sum_j P_{ij} = 1$, we will always ended up in one of state $s^j$ given current state $s^i$.

K-th Step Transitions. Given Markov Chain (fixed policy), the probability of ending at time k for all states $s_k$ could be computed as

\[P^k = P\times P\times \dots\times P\quad(k\text{-times})\]

$P^k_{ij}$ now represent the probability of starting (at $t=0$) at $s_i$ and ended up in $s_j$.

State Classification in Markov Chain. Knowing class of a markov chain, we could assure the convergence of an algorithm. In Markov Chains (MC), states can be classified into two main categories:

  1. Recurrent: If the probability of returning to s from s is 1. This means that the state will be visited infinitely many times.
  2. Transient: If there is a non-zero probability of never returning to s. This means that the state might be left.

Markov Chains Classification. Some theorem will be based on implication of some markov chains class.

  1. Recurrent Markov Chain: All states are recurrent, meaning that from any state it is possible to return to that state eventually. There are no transient states in a recurrent Markov Chain. EXAMPLE

  2. Unichain Markov Chain: Contains a single recurrent class and possibly some transient states. A recurrent class is a set of states where each state can be reached from any other state within the class, and once entered, the process cannot leave the class. Transient states are those that, once left, cannot be returned to. EXAMPLE

  3. Multichain Markov Chain: Contains more than one recurrent class. Each recurrent class is a set of states where each state can be reached from any other state within the class, and once entered, the process cannot leave the class. There may also be transient states that do not belong to any recurrent class. EXAMPLE

Theorem. In Finite MDP, there exists at least one chain.. We classify MDPs as follows:

Markov Chains Limiting State Transition Matrix P. Denote $P_\pi^\star$ the limiting state transition matrix or converged transition after time.

\[P_\pi^\star = \lim_{k\to\infty} P^k\]

Since computers cannot compute indefinitely until reaching infinity, we need practical methods to approximate $P_\pi^\star$. Here are two common approaches:

  1. Iterative method, we enumerate $k$ and stop for predefined constant or comparing converge tollerance $\Vert P^k-P^{k-1}\Vert_c < \epsilon$, for preference matrix norm $c$ and real threshold $\epsilon$.
  2. Close form solution,

\(\begin{align} (P_\pi^\star)^T P &= (P_\pi^\star)^T \\ (P_\pi^\star)^T P - (P_\pi^\star)^T &= \vec 0 \\ (P_\pi^\star)^T(P-I) &= \vec 0 \\ ((P-I)^T(P_\pi^\star))^T &= \vec 0 \\ (P-I)^T(P_\pi^\star) &= \vec 0 \end{align}\) with contraint of $P_\pi^\star$ is a stochastic matrix i.e

\[\sum_i^{\vert\mathcal S\vert} P_{\pi, i}^\star = 1\]

Theorem. In Unichain Markov Chains, all rows of the limiting state transition matrix $P_\pi^\star$ are identical

\[P_\pi^\star = \begin{bmatrix} (p_\pi^\star)^T \\ (p_\pi^\star)^T \\ \vdots \\ (p_\pi^\star)^T \end{bmatrix}\]

Then matrix $P_\pi^\star$ could be represented by single vector $p_\pi^\star$ that satisfy linear system of equations.

\[(P-I)^Tp_\pi^\star = \vec 0\]

Because it is an overdetermined equation, the trick is to

  1. Replace the last row of $(P-I)^T$ with $[1\;1\;\dots\;1]$ and store it in matrix $A$.
  2. Replace the corresponding entry on the right-hand side (which was originally a zero vector) with 1.
  3. Solve the modified system to get the $p_\pi^\star$:
\[Ap_\pi^\star = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ 1 \end{bmatrix}\]

Optimality Criteria and Optimal Policies is the solution for Markov Reward Process (MRP)., MRP is a Markov Chain with an addition to transitioning between states based on a probability distribution, each state also has an associated reward. We need to measure with respect to some optimality criterion denote by $v(\pi, s)$. We parameterize startign state $s$ since starting state will sure affect the quality, more general form could easily derive from marginalizing over initial state distribution, $\mathbb E_{s\sim\dot p}v(\pi, s)$. We then could define the optimal policy is the one that maximize value function $v$.

\[\pi^\star = \arg\max_{\pi\in\prod_{SR}} v(\pi, s),\quad \forall s\in\mathcal S\]

Theorem. In a finite MDP, an optimal deterministic policy $\pi\in\prod_{SD}$ exists. This means we could reduce the solution search into $\prod_{SD}$ domain which is finite, hence a brute force solution will exists with time complexity of $O(\vert\mathcal S\vert^{\vert\mathcal A\vert})$ (remember that deterministic policy is a function of state to action, we will generate every possible action in every possible state).

Kind of reward value function $v$. Note that all these value function is a function of policy and states. Given fixed policy $\pi$, the form of $v$ is actually a vector of size $\vert\mathcal S\vert$. Optimal policy is based on which value function (optimal criterion) is used.

\[v_{total}(\pi, s) = \mathbb E\left[\sum_{t=0}^\infty r(S_t, A_t) \mid \pi, S_0=s \right]\]

Using this, as $t\to\infty$, $v_{total}(\pi, s)\to\infty$ which is unusable in infinite horizon MDP.

\[v_g(\pi, s) = \lim_{t_{max}\to\infty} \mathbb E\left[ \frac{1}{t_{max}}\sum_{t=0}^{t_{max}} r(S_t, A_t) \mid \pi, S_0=s \right]\]

This average reward can be calculated using these closed form formula obtained from converge transition probability along with the states reward.

\[\vec v_g(\pi) = P_\pi^\star r_\pi\] \[v_\gamma(\pi) = (I-\gamma P_\pi)^{-1} r_\pi\]

The discount factor $\gamma$ ensures convergence by weighting future rewards less as they are further into the future.

\[v_b(\pi, s) = \lim_{t\to\infty} \mathbb E_{a\sim\pi, s_0\sim\dot p}\left[ \sum_{t=1}^T r(S_t, A_t)-v_g(\pi)\mid\pi, S_0=s \right]\]

The summation term represents the immediate reward $r$ and $v_g(\pi)$ represents the long-term reward. Thus, their difference can be interpreted as a measure of bias. Derived from the Bellman Average Reward Evaluation Equation (BEE) that will be explained in the following paragraphs.

Theorem. In a unichain Markov Chain, all average reward for different state is the same.

\(v_g(s^0) = v_g(s^1) = \dots = v_g(s^{\vert\mathcal S\vert})\) This means that $v_g(s)$ does not depend on the initial state $\dot p$, so we only need to compute $v_g(s^0)$ once then define $\vec v_g = v_g(s^0) \cdot \vec 1 $

Form of Reward Function. Take a note that all these reward assume fixed policy $\pi$. And their form is using usual convention, stacking states as the first dimension.

\[r(s, a, s') = \mathbb E[R\mid s, a, s'] = \sum_{r\in R} r\cdot p(r\mid s, a, s')\] \[r(s, a) = \mathbb E_{s'\sim p(s, a)}[r(s, a, s')] = \sum_{s'\in S} r(s, a, s')\cdot p(s'\mid s, a)\] \[\begin{align*} r(s) &= \mathbb E_{a\sim \pi}[r(s, a)] = \sum_{a\in\mathcal A}r(s, a)\cdot \pi(a\mid s) \\ r(s) &= \begin{bmatrix} r(s^0) \\ r(s^1) \\ \vdots \\ r(s^{\vert\mathcal S\vert-1}) \\ \end{bmatrix} \end{align*}\]

Best visualization for MDP. MDP intuitively could be imagine as a directred acyclic graph with an edge represent possibility.

Picture taken from Martin L. Puterman and Timothy C. Y. Chan. Introduction to Markov Decision Processes. 2023

Bellman Average Reward Evaluation Equation (BEE). Fundamental equation, the idea is to express immediate and future reward in one equation.

\[\begin{align*} v_b(\pi, s) + v_g(\pi) &= \mathbb E_{a\sim\pi}[r(s, a)+\mathbb E_{s'\sim p(\cdot\mid s, a)} v_b(s')] \\ v_b(\pi, s) + v_g(\pi) &= \mathbb E_{a\sim\pi}[r(s, a)+\sum_{s'\in\mathcal S} p(s'\mid s, a) v_b(s')] \end{align*}\]

However, there are $\vert S\vert + 1$ unknowns with $\vert S\vert$ number of equations ($\vert S\vert$ of $v_b(\pi, s)$ (one for each state) and single unknown $v_g(\pi)$). This forms an under-determined Linear Equations that will be solved later in following paragraphs.

Making Under-Determined BEE Well-Determine. To solve the under-determined Bellman Average Reward Evaluation Equation (BEE) System of Linear Equations (SLE), we introduce an additional constraint by setting the bias of a reference state to zero.

  1. Choose a reference state, say $s^0$, and set its bias to zero: $v_b(\pi, s^0) = 0$
  2. Rewrite the formula
\[\begin{cases} v_b(\pi, s) + v_g(\pi) &= \mathbb E_{a\sim\pi}[r(s, a)+\sum_{s'\in\mathcal S} p(s'\mid s, a) v_b(s')]\quad s\neq s^0 \\ v_g(\pi) &= \mathbb E_{a\sim\pi}[r(s, a)+\sum_{s'\in\mathcal S} p(s'\mid s, a) v_b(s')]\quad s= s^0 \\ \end{cases} \\ v_b+v_g\vec1 = r_\pi + P_\pi v_b \\ \begin{bmatrix} 0 \\ v_b(\pi, s^1) \\ \vdots \\ v_b(\pi, s^{\vert\mathcal S\vert -1}) \end{bmatrix} + \begin{bmatrix} v_g(\pi) \\ v_g(\pi) \\ \vdots \\ v_g(\pi) \end{bmatrix} = r_\pi P_\pi \begin{bmatrix} 0 \\ v_b(\pi, s^1) \\ \vdots \\ v_b(\pi, s^{\vert\mathcal S\vert -1}) \end{bmatrix}\]

Closed-form Solution for $v_b(\pi)$. Remember that $v_g(\pi)=P_\pi^\star \vec r_\pi$. Using this definition, we could manipulate in matrix form then solve it directly.

\[\begin{align*} \vec v_b(\pi) + v_g(\pi)\cdot\vec 1 &= \vec r_\pi + P_\pi\vec v_b(\pi) \\ \vec v_b(\pi) - P_\pi\vec v_b(\pi) &= \vec r_\pi - v_g(\pi)\cdot\vec 1 \\ (I-P_\pi)\vec v_b(\pi) &= \vec r_\pi - P_\pi^\star \vec r_\pi \\ \vec v_b(\pi) &= (I-P_\pi)^{-1}(\vec r_\pi - P_\pi^\star \vec r_\pi) \\ \vec v_b(\pi) &= (I-P_\pi)^{-1}\vec r_\pi - (I-P_\pi)^{-1}P_\pi^\star \vec r_\pi \\ \vec v_b(\pi) &= [\,(I-P_\pi)^{-1}- (I-P_\pi)^{-1}P_\pi^\star\,] \vec r_\pi \\ \vec v_b(\pi) &= [\,(I-P_\pi)^{-1}( I - P_\pi^\star\,)] \vec r_\pi \\ \vec v_b(\pi) &= \left( \underbrace{(I - P_\pi + P_\pi^\star)^{-1} - P_\pi^\star}_{H_\pi} \right) \vec r_\pi \\ \vec v_b(\pi) &= H_\pi \vec r_\pi \\ \end{align*}\]

Note that $H_\pi$ is not a stochastic matrix anymore.The last expression is derive using special case that $P_\pi^\star$ is a fixed point, this implies $(I−A)^{−1}(I−B)=(I−A+B)^{−1}−B$. Trust me you don’t want to see the proof, its really specific. I accept and trust this work from our ancesstor completely without any question. Ask gpt in case you need, he’s capable of deriving this.

Bellman Optimality Equation (BOE). Fundamental concept, it describes the relationship between the value of a state under an optimal policy and the expected return from that state. Optimality condition ensures the equation holds only to the optimal policy, denoted as $\pi^\star$. It is written as

\[v_b(\pi^\star, s) + v_g(\pi^\star) = \max_{a\in\mathcal A}\left[r(s, a) + \sum_{s'\in\mathcal S} p(s'\mid s, a)v_b(\pi^\star, s') \right]\]

To derive the optimal policy $\pi^\star$, we first obtain optimal $v_b^\star$ and then greedily act wrt the righthand side (RHS) of BOE. Although there may be multiple optimal policies $\pi^\star$, there exists only a single optimal $v_g^\star$ and a single optimal $v_b^\star$.

BOE vs BEE.

Model Based RL

Policy Iteration Algorithm, Exact Average Reward.

  1. Initialize iteration index and random starting policy $k=0, \pi_0$
  2. Exact Policy Evaluation : Obtain $v_b$ through closed form or solve BEE SLE
  3. Exact Policy Improvement : Update $\pi_k$ to be greedy wrt $v_b$. See equation below.
  4. Check for convergence : check if current $v_b$ is the same as before, otherwise repeat 2.
\[\hat \pi^{k+1} = \arg\max_{a\in\mathcal A}\left(r(s, a) + \sum_{s'\in\mathcal S} P(s'\mid s, a)v_b(\hat\pi^k, s') \right)\]

Policy Iteration Implementation Considerations.

  1. Handling multiple optimal policies: The algorithm should be modified to handle multiple optimal policies. If multiple optimal policies exist, the algorithm may switch back and forth between them indefinitely. By knowing that only single optimal $v_b^\star$ and $v_g^\star$, we could use this as the convergence criteria instead.
  2. Synchronous updates: Updating $\pi^{k+1}$ for all possible states is called synchronous update. In RL, asynchronous updates are often used for large state space.
  3. Stable policy: Stability is achieved when the policy no longer changes for all states. If one or more state changes, then the policy is still evolving and not yet stable.

#TODO: Still long journey