Taken on MRL course @ Universitas Indonesia [On progress]
Understanding Reinforcement Learning Categories is a Must. Reinforcement Learning (RL) spans a wide range of algorithms, each based on different assumptions, optimization principles, and feedback structures. Unlike supervised learning, the diversity of RL approaches—each requiring different theory, tools, and implementation strategies—makes the learning curve steep but rewarding.
Model-Based vs Model-Free: Availability of Transitions. A core distinction in RL is whether the transition dynamics—represented as $P(s’ \mid s, a)$, the probability of ending up in state $s’$ after taking action $a$ in state $s$—are known. In model-based RL, these dynamics are known or learned and explicitly used to plan optimal behavior. In model-free RL, we assume no access to this underlying probability, which reflects most real-world situations.
Model-Free RL: Learning Without Transition Knowledge. When we don’t know $P(s’ \mid s, a)$, we resort to model-free methods, which learn from direct interaction. These approaches can be further divided into:
Deep RL: A Revolution in Expressiveness. Deep Reinforcement Learning integrates deep neural networks with RL algorithms to handle high-dimensional state and action spaces. Deep RL enable breakthroughs in games (e.g., Atari, Go), robotics, and complex control systems. Deep RL also blurs traditional boundaries, DQNs bring deep learning to value-based methods, while Actor-Critic methods combine policy gradients and value estimation. This integration brings power, but also introduces stability and sample efficiency challenges.
This section is a personal revision on
that summarized a week lecture at University of Indonesia Machine Reinforcement Learning by Vektor Dewanto.
Formal Definition of MDP. The Markov Decision Process (MDP) is a foundational framework used across disciplines—from AI and control theory to economics and operations research. An MDP is defined as the tuple $M = (\mathcal S, \mathcal A, \mathcal T , \dot p, p, r)$, where:
Remember that $s^0$ represent state 0, $s_1^0$ represent state 0 at time $t=1$.
Lets Assume Stationary MDPs for Simplicity. In this setting, the state and action sets remain unchanged over time, and our goal is to find a stationary optimal policy. Formally, this means. State space, Action space, transition probability, and reward function is independent of time.
\[\mathcal S_t = \mathcal S,\quad \mathcal A_t = \mathcal A,\quad p_t=p,\quad r_t=r\]This relaxation could be imagine in chess as, the available actions for any board configuration never change, which simplifies modeling. In contrast, non-stationary MDPs allow these components to vary over time reflects many real-world problem.
Assuming Infinite Horizon and Discrete Time. In addition to stationarity, we often assume an infinite time horizon (i.e., the agent interacts with the environment indefinitely) and discrete time steps. Finite-horizon MDPs are also studied especially in planning problems, but often lead to time-dependent policies. We also assume finite state and action spaces, which enables representations in finite computer’s memory and guarantees convergence.
Markovian: The Markov Assumption on Time Dependency. A core assumption of MDPs is the Markov property, which states that the future state depends only on the current state and action, not on the entire history. This simplifies the transition function to:
\[\Pr(S_{t+1}\mid s_t, a_t)\]rather than
\[\Pr(S_{t+1}\mid s_0, a_0, s_1, a_1, \dots, s_t, a_t)\]This also often called the first-order Markov assumption. While real-world scenarios often violate this assumption, we either redefine the state to be more informative or use memory-based models (e.g., RNNs in Deep RL) when necessary.
MDP Visual Interpretation. A helpful intuition for this is shown in the diagram below. The agent transitions between various states like “Wake Up”, “Netflix”, “Code & Debug”, and “Deploy”, receiving different rewards depending on its action.
This environment is Markovian: the next state and reward depend only on the current state and action—not on how the agent reached that state. For example, regardless of whether the agent arrived at “Code & Debug” from “Nap” or “Wake Up”, the outcome of choosing “Sleep” will always transition to “Sleep” with a reward of 3.
Policy: The Solution to the MDP. Policy $\pi:\mathcal S\times \mathcal A\to[0, 1]$ defines the agent’s behavior by specifying the probability of choosing each action in each state.
\[\pi(a\mid s)=\Pr(A_t=a\mid S_t=s)\]There are two main classes of policies:
A deterministic policy $\pi_{SD}\in\prod_{SD}:\mathcal S\to\mathcal A$ maps single state to single action. It can be viewed as a special case of stochastic policy.
\[\pi(a\mid s) = \begin{cases} 1\quad\text{if }a=\pi_{SD}(a) \\ 0\quad\text{otherwise} \end{cases}\]Showing that $\prod_{SD}\subset\prod_{SR}$
Matrix Representation of a Policy. A policy can also be represented as a matrix, where each row corresponds to a state and each column corresponds to an action: \(\pi = \begin{bmatrix} \pi(a^0\mid s^0) & \pi(a^1\mid s^0) & \dots & \pi(a^{\vert\mathcal A\vert}\mid s^0) \\ \pi(a^0\mid s^1) & \pi(a^1\mid s^1) & \dots & \pi(a^{\vert\mathcal A\vert}\mid s^1) \\ \vdots & \vdots & \ddots & \vdots \\ \pi(a^0\mid s^{\vert\mathcal S\vert}) & \pi(a^1\mid s^{\vert\mathcal S\vert}) & \dots & \pi(a^{\vert\mathcal A\vert}\mid s^{\vert\mathcal S\vert}) \\ \end{bmatrix} \in R^{\mathcal \vert S\vert \times\mathcal \vert A\vert}\)
We see that in ith-row, $\pi(\cdot, s^i)$, represent action distribution at state $s^i$, the best action is the maximum among this rows. Please also note that sum over actions must be one $\sum_a \pi(a\mid s)=1$.
Markov Chain: Unique Action MDP. Every MDP with a fixed policy $\pi$ induces a Markov Chain (MC). A Markov Chain is an MDP where each state has a unique action. Because $p(s’\mid s, a)$ and we fix $\pi$, $p(s’\mid s, \pi(s)):\mathcal S\times\mathcal S\to[0, 1]$ could be derived called stochastic transition matrix or one-step state-transition matrix denoted as $P\in R^{\vert\mathcal S\vert\times\vert\mathcal S\vert}$ matrix. Formally
\[\begin{align} P_\pi(s_{t+1}\mid s_t) &= \mathbb E_{a\sim \pi}[ p(s_{t+1}\mid s_t, a) ] \\ P_\pi(s_{t+1}\mid s_t) &= \sum_{a\in\mathcal A} \pi(a\mid s_t)\cdot p(s_{t+1}\mid s) \\ P_{\pi, ij} = P_{ij} &= P_\pi(s_j, s_i)\quad\text{Matrix Form} \end{align}\]$P_{ij}$ describes the probability of ending at $s^j$ given current state $s^i$.
$P$ is stochastic meaning that
K-th Step Transitions. Given Markov Chain (fixed policy), the probability of ending at time k for all states $s_k$ could be computed as
\[P^k = P\times P\times \dots\times P\quad(k\text{-times})\]$P^k_{ij}$ now represent the probability of starting (at $t=0$) at $s_i$ and ended up in $s_j$.
State Classification in Markov Chain. Knowing class of a markov chain, we could assure the convergence of an algorithm. In Markov Chains (MC), states can be classified into two main categories:
Markov Chains Classification. Some theorem will be based on implication of some markov chains class.
Recurrent Markov Chain: All states are recurrent, meaning that from any state it is possible to return to that state eventually. There are no transient states in a recurrent Markov Chain. EXAMPLE
Unichain Markov Chain: Contains a single recurrent class and possibly some transient states. A recurrent class is a set of states where each state can be reached from any other state within the class, and once entered, the process cannot leave the class. Transient states are those that, once left, cannot be returned to. EXAMPLE
Multichain Markov Chain: Contains more than one recurrent class. Each recurrent class is a set of states where each state can be reached from any other state within the class, and once entered, the process cannot leave the class. There may also be transient states that do not belong to any recurrent class. EXAMPLE
Theorem. In Finite MDP, there exists at least one chain.. We classify MDPs as follows:
Markov Chains Limiting State Transition Matrix P. Denote $P_\pi^\star$ the limiting state transition matrix or converged transition after time.
\[P_\pi^\star = \lim_{k\to\infty} P^k\]Since computers cannot compute indefinitely until reaching infinity, we need practical methods to approximate $P_\pi^\star$. Here are two common approaches:
\(\begin{align} (P_\pi^\star)^T P &= (P_\pi^\star)^T \\ (P_\pi^\star)^T P - (P_\pi^\star)^T &= \vec 0 \\ (P_\pi^\star)^T(P-I) &= \vec 0 \\ ((P-I)^T(P_\pi^\star))^T &= \vec 0 \\ (P-I)^T(P_\pi^\star) &= \vec 0 \end{align}\) with contraint of $P_\pi^\star$ is a stochastic matrix i.e
\[\sum_i^{\vert\mathcal S\vert} P_{\pi, i}^\star = 1\]Theorem. In Unichain Markov Chains, all rows of the limiting state transition matrix $P_\pi^\star$ are identical
\[P_\pi^\star = \begin{bmatrix} (p_\pi^\star)^T \\ (p_\pi^\star)^T \\ \vdots \\ (p_\pi^\star)^T \end{bmatrix}\]Then matrix $P_\pi^\star$ could be represented by single vector $p_\pi^\star$ that satisfy linear system of equations.
\[(P-I)^Tp_\pi^\star = \vec 0\]Because it is an overdetermined equation, the trick is to
Optimality Criteria and Optimal Policies is the solution for Markov Reward Process (MRP)., MRP is a Markov Chain with an addition to transitioning between states based on a probability distribution, each state also has an associated reward. We need to measure with respect to some optimality criterion denote by $v(\pi, s)$. We parameterize startign state $s$ since starting state will sure affect the quality, more general form could easily derive from marginalizing over initial state distribution, $\mathbb E_{s\sim\dot p}v(\pi, s)$. We then could define the optimal policy is the one that maximize value function $v$.
\[\pi^\star = \arg\max_{\pi\in\prod_{SR}} v(\pi, s),\quad \forall s\in\mathcal S\]Theorem. In a finite MDP, an optimal deterministic policy $\pi\in\prod_{SD}$ exists. This means we could reduce the solution search into $\prod_{SD}$ domain which is finite, hence a brute force solution will exists with time complexity of $O(\vert\mathcal S\vert^{\vert\mathcal A\vert})$ (remember that deterministic policy is a function of state to action, we will generate every possible action in every possible state).
Kind of reward value function $v$. Note that all these value function is a function of policy and states. Given fixed policy $\pi$, the form of $v$ is actually a vector of size $\vert\mathcal S\vert$. Optimal policy is based on which value function (optimal criterion) is used.
Using this, as $t\to\infty$, $v_{total}(\pi, s)\to\infty$ which is unusable in infinite horizon MDP.
This average reward can be calculated using these closed form formula obtained from converge transition probability along with the states reward.
\[\vec v_g(\pi) = P_\pi^\star r_\pi\]The discount factor $\gamma$ ensures convergence by weighting future rewards less as they are further into the future.
The summation term represents the immediate reward $r$ and $v_g(\pi)$ represents the long-term reward. Thus, their difference can be interpreted as a measure of bias. Derived from the Bellman Average Reward Evaluation Equation (BEE) that will be explained in the following paragraphs.
Theorem. In a unichain Markov Chain, all average reward for different state is the same.
\(v_g(s^0) = v_g(s^1) = \dots = v_g(s^{\vert\mathcal S\vert})\) This means that $v_g(s)$ does not depend on the initial state $\dot p$, so we only need to compute $v_g(s^0)$ once then define $\vec v_g = v_g(s^0) \cdot \vec 1 $
Form of Reward Function. Take a note that all these reward assume fixed policy $\pi$. And their form is using usual convention, stacking states as the first dimension.
Best visualization for MDP. MDP intuitively could be imagine as a directred acyclic graph with an edge represent possibility.
Bellman Average Reward Evaluation Equation (BEE). Fundamental equation, the idea is to express immediate and future reward in one equation.
\[\begin{align*} v_b(\pi, s) + v_g(\pi) &= \mathbb E_{a\sim\pi}[r(s, a)+\mathbb E_{s'\sim p(\cdot\mid s, a)} v_b(s')] \\ v_b(\pi, s) + v_g(\pi) &= \mathbb E_{a\sim\pi}[r(s, a)+\sum_{s'\in\mathcal S} p(s'\mid s, a) v_b(s')] \end{align*}\]However, there are $\vert S\vert + 1$ unknowns with $\vert S\vert$ number of equations ($\vert S\vert$ of $v_b(\pi, s)$ (one for each state) and single unknown $v_g(\pi)$). This forms an under-determined Linear Equations that will be solved later in following paragraphs.
Making Under-Determined BEE Well-Determine. To solve the under-determined Bellman Average Reward Evaluation Equation (BEE) System of Linear Equations (SLE), we introduce an additional constraint by setting the bias of a reference state to zero.
Closed-form Solution for $v_b(\pi)$. Remember that $v_g(\pi)=P_\pi^\star \vec r_\pi$. Using this definition, we could manipulate in matrix form then solve it directly.
\[\begin{align*} \vec v_b(\pi) + v_g(\pi)\cdot\vec 1 &= \vec r_\pi + P_\pi\vec v_b(\pi) \\ \vec v_b(\pi) - P_\pi\vec v_b(\pi) &= \vec r_\pi - v_g(\pi)\cdot\vec 1 \\ (I-P_\pi)\vec v_b(\pi) &= \vec r_\pi - P_\pi^\star \vec r_\pi \\ \vec v_b(\pi) &= (I-P_\pi)^{-1}(\vec r_\pi - P_\pi^\star \vec r_\pi) \\ \vec v_b(\pi) &= (I-P_\pi)^{-1}\vec r_\pi - (I-P_\pi)^{-1}P_\pi^\star \vec r_\pi \\ \vec v_b(\pi) &= [\,(I-P_\pi)^{-1}- (I-P_\pi)^{-1}P_\pi^\star\,] \vec r_\pi \\ \vec v_b(\pi) &= [\,(I-P_\pi)^{-1}( I - P_\pi^\star\,)] \vec r_\pi \\ \vec v_b(\pi) &= \left( \underbrace{(I - P_\pi + P_\pi^\star)^{-1} - P_\pi^\star}_{H_\pi} \right) \vec r_\pi \\ \vec v_b(\pi) &= H_\pi \vec r_\pi \\ \end{align*}\]Note that $H_\pi$ is not a stochastic matrix anymore.The last expression is derive using special case that $P_\pi^\star$ is a fixed point, this implies $(I−A)^{−1}(I−B)=(I−A+B)^{−1}−B$. Trust me you don’t want to see the proof, its really specific. I accept and trust this work from our ancesstor completely without any question. Ask gpt in case you need, he’s capable of deriving this.
Bellman Optimality Equation (BOE). Fundamental concept, it describes the relationship between the value of a state under an optimal policy and the expected return from that state. Optimality condition ensures the equation holds only to the optimal policy, denoted as $\pi^\star$. It is written as
\[v_b(\pi^\star, s) + v_g(\pi^\star) = \max_{a\in\mathcal A}\left[r(s, a) + \sum_{s'\in\mathcal S} p(s'\mid s, a)v_b(\pi^\star, s') \right]\]To derive the optimal policy $\pi^\star$, we first obtain optimal $v_b^\star$ and then greedily act wrt the righthand side (RHS) of BOE. Although there may be multiple optimal policies $\pi^\star$, there exists only a single optimal $v_g^\star$ and a single optimal $v_b^\star$.
BOE vs BEE.
Policy Iteration Algorithm, Exact Average Reward.
Policy Iteration Implementation Considerations.
#TODO: Still long journey