World Models - From Dyna to Foundation Simulators

A decade of teaching machines to imagine. From Sutton's 1990 Dyna planning loop through Ha & Schmidhuber's dreaming agents, the RSSM latent-dynamics lineage, foundation-scale video simulators, and LeCun's JEPA thesis. Includes open problems and a portfolio project guide.

A decade of teaching machines to imagine. This post traces the intellectual lineage of world models: from Sutton’s 1990 Dyna planning loop, through latent-dynamics RL, to billion-parameter video simulators and LeCun’s architecture for machine common sense.


The throughline

A world model is a learned internal simulator: given the current state and a proposed action, it predicts what the world will look like next. An agent with a good world model can plan without acting, rehearse without risk, and generalize beyond its training distribution.

The core idea is almost embarrassingly simple. Humans do it constantly: you do not need to drop your phone to predict it will fall. You run the simulation internally. The question is how to give that capability to machines, and how compressed, accurate, and fast that internal simulator needs to be.

The field has answered this question in four distinct waves over thirty years:

  1. Tabular planning (1990s): small state spaces, exact transitions, classical MBRL
  2. Latent dynamics (2018-2022): learn a compressed representation, plan and imagine there
  3. Foundation simulators (2022-2024): scale to pixels and video; one model, many environments
  4. Predictive architectures (2022-present): predict in representation space, not pixel space; LeCun’s JEPA program

Each wave built on the previous one’s failures. Understanding why each generation broke down is at least as important as understanding how the next generation fixed it.

WAVE 1: TABULAR 1990-2017 Dyna, Dyna-2 linear models WAVE 2: LATENT DYNAMICS 2018-2022 Ha+Schmidhuber, PlaNet Dreamer V1/V2/V3 WAVE 3: FOUNDATION 2022-2024 GAIA-1, Genie, Cosmos GameNGen, DIAMOND WAVE 4: PREDICTIVE 2022-present I-JEPA, V-JEPA AMI / LeWorldModel breaks at: scale + pixels breaks at: compounding error breaks at: pixel prediction cost ongoing... Common thread across all waves: compress observations --> predict futures in compressed space --> plan or train with imagined rollouts
Four waves of world model research and the failure mode that motivated each transition.

Act I - Roots (1990-2017)

Sutton’s Dyna, 1990

The intellectual seed of every modern world model is Richard S. Sutton’s 1990 paper Dyna: An Integrated Architecture for Learning, Planning, and Reacting (SIGART).

The insight is elegant: an agent’s interaction with the environment generates data. You can use that data to fit two things simultaneously: a direct policy (standard RL) and an environment model (predict next state and reward given current state and action). Once you have the model, you can generate synthetic transitions and update the policy on them, getting more gradient updates per real interaction. Sutton called the model-generated updates planning and the real-environment updates direct RL.

The Dyna update rule per real step $(s, a, r, s’)$:

\[Q(s, a) \;\leftarrow\; Q(s, a) + \alpha\Big[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\Big]\]

Then for $k$ planning steps, sample $(s, a)$ from memory, query the model for $(\hat{r}, \hat{s}’)$, and apply the same update. The Q-table converges $k$ times faster per real interaction when the model is accurate.

Real Env Agent / Q World Model Sim. transitions s, a, r, s' (real) update model k planning steps update Q act: a
The Dyna cycle: real interaction updates both the policy and the model; the model then generates k synthetic transitions for additional policy updates at no real cost.

Through the 1990s and 2000s, Dyna variants proliferated: Dyna-2 (Silver et al.), prioritized sweeping, trajectory sampling. All used tabular or linear approximations. They worked in small, discrete state spaces and fell apart the moment states became continuous or observations were images.

The deep learning revolution changed the premise. If you could learn a model from raw pixels using a neural network, you could extend Dyna to arbitrary observation spaces. That is exactly what the next decade explored.


Act II - The 2018 Manifesto

Ha & Schmidhuber: World Models (NeurIPS 2018 workshop)

David Ha and Jürgen Schmidhuber’s paper World Models (arXiv 2018, presented at the NeurIPS Creativity workshop) is the founding document of modern world model research. It is simultaneously a research paper and a manifesto, and it remains the best introduction to the central idea.

Their architecture has three components, called V, M, and C:

V (Vision): Variational Autoencoder. A VAE compresses each frame $x_t \in \mathbb{R}^{64 \times 64 \times 3}$ into a latent vector $z_t \in \mathbb{R}^{32}$.

M (Memory): MDN-RNN. A mixture-density network built on an LSTM takes $(z_t, a_t)$ and predicts the distribution of the next latent $z_{t+1}$. Specifically it outputs the parameters of a mixture of Gaussians:

\[p(z_{t+1} \mid z_t, a_t, h_t) \;=\; \sum_{k=1}^{K} \pi_k \; \mathcal{N}(z_{t+1};\; \mu_k,\; \sigma_k^2)\]

where $h_t$ is the LSTM hidden state. The stochastic sampling means the model does not collapse to the mean prediction; it can represent multi-modal futures.

C (Controller): Linear policy. A single linear layer maps $(z_t, h_t)$ to action $a_t$. It has very few parameters by design. The complexity sits in V and M, making C fast to optimize with CMA-ES.

frame x_t V VAE encoder 64x64x3 -> 32 z_t M MDN-RNN (z_t, a_t, h_t) -> p(z_{t+1}) hidden h_t C Linear layer (z_t, h_t) -> a_t
The V-M-C architecture. V compresses frames to z; M predicts future z from past z, action, and hidden state; C maps (z, h) to actions. Most capacity lives in V and M, keeping C tiny enough for CMA-ES.

The most provocative result: the agent can be trained entirely inside the dream. The M model generates imagined trajectories; C is optimized on them; the resulting policy transfers back to the real environment. In VizDoom, the agent trained 100% in dream outperformed the agent trained in reality because the dream is faster and more controllable.

“We can imagine ourselves performing some task that hasn’t yet happened, and evaluate the probable outcomes.”

Limitations that the next generation addressed:


Act III - Structured Latent Dynamics (2019-2021)

PlaNet: Learning Latent Dynamics for Planning from Pixels (ICML 2019)

Hafner, Lillicrap, Fischer, Villegas, Ha, Lee, and Davidson at Google Brain introduced PlaNet (deep planning network). The key insight: if you are going to plan, the world model should be explicitly designed for planning, not just for prediction.

PlaNet’s central contribution is the Recurrent State-Space Model (RSSM), which separates state into two complementary pathways:

\[h_t = f_\theta(h_{t-1},\, z_{t-1},\, a_{t-1}) \qquad \text{(deterministic GRU)}\] \[z_t \sim \begin{cases} q_\phi(z_t \mid h_t, o_t) & \text{posterior (with observation)} \\ p_\psi(z_t \mid h_t) & \text{prior (planning, no observation)} \end{cases}\]

The deterministic $h_t$ provides long-range memory; the stochastic $z_t$ lets the model represent multiple possible futures. At planning time you use only the prior, rolling the model forward without touching the real environment.

h_{t-1} (GRU) z_{t-1} h_t (GRU) z_t o_t (posterior) h_{t+1} (GRU) z_{t+1} a_{t-1} a_t posterior deterministic path (memory) stochastic path (multi-future)
The RSSM. The GRU (blue) carries deterministic memory; the stochastic state z (green) is sampled from the posterior during training and from the prior during planning. The dashed arrow from o_t only exists during training.

Planning uses CEM (Cross-Entropy Method): sample $J$ action sequences, evaluate under the world model, keep the top $K$, refit the proposal. PlaNet achieves 50x better sample efficiency than model-free methods on six DMControl tasks.

Dreamer: Dream to Control (ICLR 2020)

Hafner, Lillicrap, Ba, and Norouzi replaced the CEM planner with a learned actor-critic in imagination. The world model is trained by optimizing the ELBO over sequences:

\[\mathcal{L}_\text{WM} = \mathbb{E}_{q}\!\left[\sum_t \underbrace{\log p(o_t \mid z_t, h_t)}_\text{reconstruction} + \underbrace{\log p(r_t \mid z_t, h_t)}_\text{reward pred.} - \underbrace{\beta\,\text{KL}\!\left[q(z_t \mid h_t, o_t) \;\|\; p(z_t \mid h_t)\right]}_\text{regularization}\right]\]

The actor $\pi_\phi$ and critic $V_\psi$ are then trained purely on imagined rollouts from the RSSM using $\lambda$-returns:

\[V^\lambda_t = r_t + \gamma\Big[(1-\lambda) V_\psi(z_{t+1}, h_{t+1}) + \lambda V^\lambda_{t+1}\Big]\]

This is a clean separation: real data shapes the model; imagined data shapes the policy. Dreamer established state-of-the-art on DMControl from pixels using 20x fewer environment interactions than D4PG.

DreamerV2: Mastering Atari with Discrete World Models (ICLR 2021)

Hafner, Lillicrap, Norouzi, and Ba replaced Gaussian latents with categorical representations: each $z_t$ is a vector of one-hot variables drawn from learned categorical distributions. The gradient flows through the Straight-Through (ST) estimator:

\[z_t \;\sim\; \text{Cat}\!\left(\text{softmax}(f_\phi(h_t, o_t))\right), \qquad \nabla_\phi \approx \nabla_\phi \hat{z}_t \big|_{\hat{z}_t \;\leftarrow\; z_t}\]

where $\hat{z}t = z_t + \text{sg}(\text{softmax}(f\phi) - z_t)$ and $\text{sg}(\cdot)$ is stop-gradient. Categorical latents are sharper than Gaussians: each slot encodes a discrete, interpretable feature rather than a blurry mean.

DreamerV2 is the first model-based method to match Rainbow and IQN on the Atari 100k benchmark, using only 400 minutes of gameplay.


Act IV - Scaling and Generalization (2022-2023)

DreamerV3: Mastering Diverse Domains (2023)

Hafner, Pasukonis, Ba, and Lillicrap pushed for a single agent that can handle any task without hyperparameter tuning. DreamerV3 introduces symlog predictions to handle rewards of vastly different scales:

\[\text{symlog}(x) = \text{sign}(x)\,\ln\!\left(|x| + 1\right), \qquad \text{symexp}(x) = \text{sign}(x)\!\left(e^{|x|} - 1\right)\]

Targets are transformed with symlog before computing the loss; predictions are exponentiated back with symexp. A reward of $10^6$ and a reward of $0.01$ now live in the same comfortable range.

Additional improvements include KL balancing, free bits to prevent posterior collapse, Block GRU with RMSNorm and SiLU activations, and the LaProp optimizer.

The headline result: DreamerV3 collects diamonds in Minecraft from scratch, without human demonstrations or curricula, the first algorithm to do so. Diamonds require 20+ sequential crafting and mining steps spanning minutes of real time.

IRIS: Transformers are Sample-Efficient World Models (ICLR 2023, notable top 5%)

Micheli, Alonso, and Fleuret at Geneva replaced the RSSM with a transformer world model. IRIS tokenizes each frame with a VQ-VAE into $L$ discrete tokens, then predicts the next frame’s tokens, reward, and done-flag autoregressively:

\[\mathcal{L}_\text{WM} = -\sum_{t} \log p_\theta\!\left(v_{t+1}^{(1:L)},\, r_{t+1},\, d_{t+1} \;\Big|\; v_{\le t},\, a_{\le t}\right)\]

where $v_t^{(1:L)}$ are the $L$ VQ-VAE tokens for frame $t$. This is exactly a language model loss over a (frame, action) vocabulary. The transformer’s global attention can, in principle, model arbitrarily long dependencies without the compounding error of recurrent models.

On Atari 100k, IRIS achieves a mean human-normalized score of 1.046, outperforming humans on 10 of 26 games.

TD-MPC2: Scalable, Robust World Models for Continuous Control (ICLR 2024)

Hansen, Su, and Wang at UC San Diego learned a task-oriented latent dynamics model and terminal value function jointly via temporal-difference learning. The world model is not required to reconstruct observations; it only predicts latent features that matter for value estimation. A single 317M-parameter agent generalizes across 80 tasks, multiple embodiments, and action spaces.

Method Single-task score Multi-task (80 tasks)
SAC (model-free) competitive collapses
DreamerV3 competitive limited
TD-MPC2 (317M) competitive state of the art

Act V - Foundation World Models (2023-2024)

The RL-focused papers above all treat the world model as an internal component of an agent. A parallel thread asks: what if the world model is a standalone foundation model, pre-trained on massive video datasets, and then adapted to downstream tasks?

GAIA-1: A Generative World Model for Autonomous Driving (Wayve, 2023)

Wayve’s GAIA-1 (arXiv 2309.17080) is a 9-billion-parameter autoregressive video model trained on driving data. It factorizes future video autoregressively over discrete video tokens, conditioned on text $c$, past video, and ego-vehicle actions:

\[p(v_{1:T} \mid c,\, a_{1:T}) \;=\; \prod_{t=1}^T \prod_{\ell=1}^{L} p_\theta\!\left(v_t^{(\ell)} \;\Big|\; v_t^{(1:\ell-1)},\, v_{<t},\, a_{<t},\, c\right)\]

GAIA-1 supports three conditioning modalities: video history, text descriptions, and ego-vehicle actions (steer, accelerate, brake). Scaling to 9B parameters showed that larger models produce more temporally consistent and physically plausible futures, echoing LLM scaling laws.

UniSim: Learning Interactive Real-World Simulators (Google, 2023)

Google’s UniSim (arXiv 2310.06114) trained a 5.6B-parameter generative model to simulate real-world interactions across diverse domains: robot manipulation, navigation, human-object interaction. Given a starting frame and a high-level instruction $c$ (“open the drawer”), UniSim generates video of the resulting interaction by conditioning a video diffusion model:

\[p_\theta(x_{1:T} \mid x_0, c) \;=\; \prod_{t=1}^T p_\theta(x_t \mid x_{t-1},\, c)\]

The goal is zero-shot sim-to-real transfer: train policies entirely inside the simulator, then deploy in the real world without fine-tuning.

DIAMOND: Diffusion for World Modeling (NeurIPS 2024 Spotlight)

Alonso, Jelley, Richard, and Bordes replaced the discrete-token world model with a diffusion process. DIAMOND trains a denoising diffusion model to generate the next frame:

\[\mathcal{L}_\text{DIAMOND} \;=\; \mathbb{E}_{k,\,\epsilon}\Big[\big\|\epsilon - \epsilon_\theta\!\left(x_{t+1}^{(k)},\, k,\, x_{t-K:t},\, a_t\right)\big\|^2\Big]\]

where $x_{t+1}^{(k)} = \sqrt{\bar\alpha_k}\,x_{t+1} + \sqrt{1-\bar\alpha_k}\,\epsilon$ is the noisy target frame at diffusion step $k$, and $\epsilon_\theta$ is the denoising network conditioned on the past $K$ frames and current action.

Compressing into a small codebook (as IRIS does) discards visual information that helps the agent distinguish subtly different states. Diffusion preserves this detail at the cost of slower generation.

On Atari 100k, DIAMOND achieves a mean human-normalized score of 1.46, the new state of the art for agents trained entirely within a world model.

GameNGen: Diffusion Models are Real-Time Game Engines (Google, 2024)

Valevski et al. (arXiv 2408.14837) fine-tuned Stable Diffusion to simulate DOOM in real time at 20 FPS on a single TPU. Each frame is generated by conditioning on past frames and the current action:

\[x_{t+1} \;\sim\; p_\theta\!\left(\cdot \;\Big|\; x_{t-K:t},\, a_t\right), \qquad \text{(diffusion ancestral sampling)}\]

Human evaluators shown short clips could not distinguish GameNGen from real DOOM at above-chance accuracy. The model handles turning, strafing, enemy AI, damage, and door mechanics, all emergently from next-frame prediction.


Act VI - Generative and Interactive Worlds

Genie: Generative Interactive Environments (ICML 2024, Oral)

Bruce, Dennis, Edwards, and colleagues at Google DeepMind (arXiv 2402.15391) trained Genie on 200,000 hours of internet gaming videos with no action labels. The 11B-parameter model has three components:

1. ST Tokenizer spatiotemporal video tokens 2. Latent Action Model unsupervised: cluster visual transitions -> actions 3. Dynamics Model autoregressive over tokens + latent action video tokens latent action frame tokens
Genie's three components. The latent action model infers actions from video transitions with no labels; the dynamics model generates the next frame given past tokens and the inferred action.

The key result: give Genie any image (photograph, sketch, synthetic render) and it generates a frame-by-frame controllable interactive world, where the action space was never labeled during training.

Genie 2 (December 2024) extended this to 3D. Genie 3 (August 2025, public January 2026), integrated into Waymo’s “Waymo World Model,” generates extreme long-tail driving scenarios (tornados, floods, unusual obstacles) from text prompts.

Sora as a World Model (OpenAI, February 2024)

OpenAI’s Sora technical report positioned the model as a world simulator. Sora is a Diffusion Transformer (DiT): patches of video are treated as tokens, and a transformer denoises them jointly:

\[x_0 \approx \epsilon_\theta\!\left(x_t,\, t,\, c\right), \qquad x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

where $x_t$ is the noisy video, $t$ is the diffusion step, and $c$ is the text condition. The transformer operates on flattened spacetime patches, so it can model long-range temporal dependencies within a clip.

Whether Sora is “really” a world model is contested: it cannot be queried for actions, exposes no state space, and has no planning interface. But it generates 3D-consistent, causally plausible video up to one minute long. OpenAI published no architecture details.


Act VII - JEPA and LeCun’s Thesis

The Position Paper (LeCun, 2022)

In February 2022, Yann LeCun published “A Path Towards Autonomous Machine Intelligence.” His central critique: predicting pixels/tokens forces the model to expend capacity on irrelevant low-level details. What matters for reasoning is high-level abstract state, not pixel values.

His proposed alternative: predict an abstract embedding of the future from an abstract embedding of the present. The JEPA objective (simplified) is:

\[\min_{\theta, \phi} \; \mathbb{E}\Big[ \big\| s_y - \hat{s}_y \big\|^2 \Big], \qquad s_y = \text{enc}_\phi(y), \quad \hat{s}_y = \text{pred}_\theta(s_x, z)\]

where $x$ is the context, $y$ is the target, $z$ is a latent variable carrying unpredictable information, and $\phi$ is shared between context and target encoders. A collapse-prevention mechanism (VICReg, masking, or EMA targets) ensures the encoder does not learn trivial constant representations.

The key contrast with generative models:

  Generative (Dreamer, GAIA) JEPA
Prediction target $x_{t+1}$ (pixels/tokens) $\text{enc}(x_{t+1})$ (representation)
Decoder required yes no
Capacity spent on every pixel only informative features
Collapse risk low (reconstruction forces signal) must prevent explicitly

I-JEPA: Image-based JEPA (CVPR 2023)

Assran, Duval, Misra, Balestriero, LeCun, and others at Meta introduced I-JEPA. A large fraction of image patches is masked; the predictor reconstructs their embeddings from the unmasked context. The target encoder is an exponential moving average (EMA) of the online encoder, providing stable targets:

\[\phi_\text{target} \;\leftarrow\; m\,\phi_\text{target} + (1-m)\,\phi_\text{online}, \qquad m \in [0.996, 1.0)\]

I-JEPA matches or outperforms masked autoencoders (MAE) on ImageNet classification while using far fewer training iterations, because it does not waste capacity reconstructing pixels.

V-JEPA: Video-based JEPA (Meta, 2024)

Bardes, Ponce, and LeCun extended I-JEPA to video by masking spatiotemporal tubes and predicting their embeddings. The tube-masking strategy:

\[\mathcal{L} = \frac{1}{|M|}\sum_{(i,j,t) \in M} \Big\|\text{pred}_\theta\!\left(\text{enc}_\phi(x_\text{context}),\, \text{pos}(i,j,t)\right) - \bar{s}_{i,j,t}\Big\|^2\]

where $M$ is the set of masked spatiotemporal positions and $\bar{s}_{i,j,t}$ is the EMA-target embedding at that position. V-JEPA trains on 2+ million unlabeled videos and learns physical and temporal regularities without explicit supervision.

V-JEPA 2 (June 2025) combined 1M hours of internet video with robot trajectory data, demonstrating physical reasoning and short-horizon planning for robotic manipulation.

AMI Labs and the “LeWorldModel” (2025-present)

In late 2025, Yann LeCun confirmed the launch of Advanced Machine Intelligence (AMI) Labs, explicitly building on JEPA. The accompanying paper “Value-guided action planning with JEPA world models” (arXiv 2601.00844) augments the JEPA world model with a learned value function for multi-step planning:

\[\pi^*(s_0) = \arg\max_{a_{0:H-1}} \sum_{t=0}^{H-1} \gamma^t V_\psi(\hat{s}_t), \qquad \hat{s}_{t+1} = \text{pred}_\theta(\hat{s}_t, a_t)\]

Planning happens entirely in representation space: roll out the predictor $H$ steps, score with $V_\psi$, select the action sequence that maximizes predicted value. No pixel generation is needed. A 15M-parameter JEPA model trained on a single GPU achieves competitive navigation and manipulation results, motivating AMI’s $3-5 billion valuation.


The Frontier (2025)

NVIDIA Cosmos (January 2025)

NVIDIA released Cosmos (arXiv 2501.03575) as a family of open-weight World Foundation Models for physical AI: Cosmos-Predict (generate future video from multi-modal inputs), Cosmos-Transfer (domain adaptation for robotics/AV), and Cosmos-Reason (vision-language model with physical AI reasoning trained on 200M curated video clips).

The landscape in mid-2026

System Org Scale Domain Key contribution
DreamerV3 Google 200M Any RL env Fixed hyperparams, Minecraft diamonds
IRIS Geneva - Atari Transformer WM, 1.046 HNS
DIAMOND Edinburgh/Paris - Atari Diffusion WM, 1.46 HNS
TD-MPC2 UC San Diego 317M Continuous ctrl TD latent model
GAIA-1/2 Wayve 9B Driving Autoregressive driving video
GameNGen Google - Doom Real-time game simulation
Genie/2/3 DeepMind 11B+ Interactive env Unsupervised action space
Cosmos NVIDIA variable Robotics/AV Physical AI foundation model
V-JEPA 2 Meta large Video/robotics Rep.-space prediction
AMI/LeWorldModel AMI Labs 15M+ General JEPA + value planning

Open Problems

1. Compounding error over long horizons

Every world model degrades as you roll it further into the future. If the per-step prediction error is $\epsilon$ and the dynamics are $L$-Lipschitz, the error at horizon $H$ is bounded by:

\[\epsilon_H \;\leq\; \sum_{t=1}^{H} L^{H-t}\,\epsilon_t \;\leq\; \frac{L^H - 1}{L - 1}\,\epsilon\]

For $L > 1$ (which is generic), the error bound grows exponentially with horizon. This is why DreamerV3 uses 15-step imagined rollouts: the error after 100 steps is catastrophic for policy optimization.

The gap: language models maintain coherence over thousands of tokens because grammar and semantics tightly constrain adjacent tokens. Physical states are not so constrained; a 1-pixel perturbation can cascade into a completely different trajectory.

2. Partial observability and state estimation

Most world model benchmarks give the agent near-complete visual access to the environment. Real tasks involve partial observability, requiring a belief state $b_t$ that is updated as new observations arrive:

\[b_{t+1}(s') \;\propto\; p(o_{t+1} \mid s')\!\sum_{s} T(s,\, a_t,\, s')\,b_t(s)\]

The RSSM handles this via the stochastic $z_t$ pathway, but a single Gaussian vector cannot represent the full posterior over multiple distinct hypotheses about occluded objects. Mixture-of-belief architectures remain an open research direction.

3. Causal reasoning vs. correlation modeling

A world model trained on observational data learns correlations, not causes. Pearl’s causal hierarchy defines three levels:

Level Operation Example
L1: observation $p(Y \mid X)$ “Rain predicts traffic jams”
L2: intervention $p(Y \mid \text{do}(X))$ “Seeding road with ice causes jams”
L3: counterfactual $p(Y_x \mid X’, Y’)$ “Would there have been a jam if I hadn’t intervened?”

Current world models operate at L1 and partially at L2 (actions are treated as do-operators). L3 is essentially open for learned models.

4. Compositional generalization

Can a world model trained in an office environment correctly simulate a kitchen it has never seen? The challenge is compositional: the model knows about cups, knows about kitchens, but has not seen their combination. Systematic compositional generalization is a known failure mode of deep networks and especially acute for world models that must maintain physical consistency across novel combinations.

5. The right prediction space

LeCun’s critique of pixel-prediction is largely accepted. But what is the right abstract prediction space? Comparing error growth across representations:

\[\text{RSSM (Gaussian):}\;\; \mathcal{N}(\mu, \sigma^2) \qquad \text{DreamerV2 (categorical):}\;\; \text{Cat}(K \times C) \qquad \text{JEPA (EMA):}\;\; \text{enc}(x) \in \mathbb{R}^d\]

No consensus exists. Each trades off expressiveness, training stability, and planning compatibility differently.

6. Grounding and action conditioning

Genie and JEPA-style models learn rich world representations but lack stable action grounding: the learned latent action space is not aligned with real physical controls. The ideal alignment would satisfy:

\[\text{enc}(\text{env after doing }a) \;\approx\; \text{pred}(\text{enc}(\text{env}),\; a)\]

Bridging the representation-learning world model (JEPA branch) and the RL world model (Dreamer branch) is the most pressing open problem in the field.


Fun Projects for Your Portfolio

These are concrete, scoped projects doable on a single GPU in days to weeks, each addressing an open problem from the list above.

Project 1: RSSM from Scratch on DMControl

Re-implement the RSSM from PlaNet in under 300 lines of PyTorch. Train on Cartpole Swingup from pixels. The key evaluation: plot prediction error vs. horizon for the prior (planning mode, no observations) vs. the posterior (inference mode, with observations):

\[\text{pred. error at H} = \mathbb{E}\!\left[\|o_t - \hat{o}_t\|^2\right], \quad t \in \{1, 5, 10, 20, 50\}\]

This directly visualizes the compounding-error problem from Open Problem 1.

Portfolio signal: deep MBRL implementation; quantified understanding of latent dynamics.

Project 2: Mini-IRIS on a Simple Game

Train a transformer world model (VQ-VAE tokenizer + small GPT) on a MiniAtari or Minigrid environment. Implement the sequence modeling loss $\mathcal{L}_\text{WM}$ from the IRIS section. Plot sample efficiency curves: imagined-rollout agent vs. model-free PPO baseline.

Portfolio signal: transformer-based MBRL; discrete representation learning.

Project 3: World Model as a Data Augmenter

Take a pre-trained DreamerV3 world model on one Atari game. Intervene on the latent $z_t$ of a specific frame (shift object position by $\delta$ in latent space), roll forward, and use the resulting trajectories as counterfactual augmentation. Measure robustness improvement:

\[\Delta_\text{robust} = \text{success rate with aug.} - \text{success rate without aug.}\]

Portfolio signal: causal intervention via latents; intersection of world models and data augmentation.

Project 4: Horizon-Aware Policy Optimization

Train DreamerV3 but maintain a per-step confidence $c_t \in [0, 1]$ that decays as imagined rollouts lengthen. Weight the $\lambda$-return by confidence:

\[\tilde{V}^\lambda_t = c_t \cdot V^\lambda_t + (1 - c_t) \cdot V_\psi(z_t, h_t)\]

Test whether this prevents the policy from exploiting model errors in long-horizon rollouts.

Portfolio signal: practical safety in MBRL; directly addresses Open Problem 1.

Project 5: JEPA vs. Pixel Prediction for Downstream Planning

Train two small world models on the same dataset: one RSSM + decoder (pixel prediction), one RSSM without decoder (JEPA-style, predict next latent). Compare planning performance on a downstream task at the same rollout budget $H$:

planning horizon H (steps) error 0 5 10 15 20 pixel pred. JEPA pred.
Hypothesized error-vs-horizon curves: JEPA-style representation-space prediction should accumulate error more slowly than pixel-space prediction because it avoids modeling irrelevant variation.

Portfolio signal: direct experimental test of the LeCun hypothesis.

Project 6: World Model for Text-to-Motion Planning

Apply a JEPA-style world model to text-conditioned robot motion. Train on video of human actions + text captions. At test time: given a command, roll the world model forward over candidate action sequences, score with CLIP alignment reward, execute the best:

\[a^*_{0:H} = \arg\max_{a_{0:H}} \sum_{t=0}^H \text{CLIP-sim}\!\left(\text{pred}_\theta(\hat{s}_t, a_t),\; \text{enc}_\text{text}(c)\right)\]

No action labels needed during training.

Portfolio signal: multimodal world models; language conditioning; zero-shot transfer.

Project 7: Measuring World Model Calibration

A world model is useful only if its uncertainty estimates are trustworthy. Build an evaluation suite measuring calibration of a world model: at horizon $H$, what fraction of true future observations falls within the predicted $p$% confidence interval?

\[\text{Calib. error} = \left|\hat{P}(o_{t+H} \in C_p) - p\right|, \qquad p \in \{50\%, 80\%, 95\%\}\]

Test across DreamerV3 and IRIS on multiple Atari games. Identify which games and which horizons have the worst calibration. Measurement papers of this kind are highly valued at NeurIPS and ICLR.

Portfolio signal: principled evaluation methodology; the often-neglected calibration problem in MBRL.


This survey covers the main lineage from Sutton’s 1990 Dyna through the foundation world models of 2024-2025. The field is moving fast: every top venue (NeurIPS, ICML, ICLR, EMNLP for language-grounded variants, CVPR/ICCV for vision) now has a dedicated world model track or workshop. If you want to find the frontier, watch the ICLR 2025 workshop on world models and the NeurIPS 2025 proceedings.