A decade of teaching machines to imagine. From Sutton's 1990 Dyna planning loop through Ha & Schmidhuber's dreaming agents, the RSSM latent-dynamics lineage, foundation-scale video simulators, and LeCun's JEPA thesis. Includes open problems and a portfolio project guide.
A decade of teaching machines to imagine. This post traces the intellectual lineage of world models: from Sutton’s 1990 Dyna planning loop, through latent-dynamics RL, to billion-parameter video simulators and LeCun’s architecture for machine common sense.
A world model is a learned internal simulator: given the current state and a proposed action, it predicts what the world will look like next. An agent with a good world model can plan without acting, rehearse without risk, and generalize beyond its training distribution.
The core idea is almost embarrassingly simple. Humans do it constantly: you do not need to drop your phone to predict it will fall. You run the simulation internally. The question is how to give that capability to machines, and how compressed, accurate, and fast that internal simulator needs to be.
The field has answered this question in four distinct waves over thirty years:
Each wave built on the previous one’s failures. Understanding why each generation broke down is at least as important as understanding how the next generation fixed it.
The intellectual seed of every modern world model is Richard S. Sutton’s 1990 paper Dyna: An Integrated Architecture for Learning, Planning, and Reacting (SIGART).
The insight is elegant: an agent’s interaction with the environment generates data. You can use that data to fit two things simultaneously: a direct policy (standard RL) and an environment model (predict next state and reward given current state and action). Once you have the model, you can generate synthetic transitions and update the policy on them, getting more gradient updates per real interaction. Sutton called the model-generated updates planning and the real-environment updates direct RL.
The Dyna update rule per real step $(s, a, r, s’)$:
\[Q(s, a) \;\leftarrow\; Q(s, a) + \alpha\Big[r + \gamma \max_{a'} Q(s', a') - Q(s, a)\Big]\]Then for $k$ planning steps, sample $(s, a)$ from memory, query the model for $(\hat{r}, \hat{s}’)$, and apply the same update. The Q-table converges $k$ times faster per real interaction when the model is accurate.
Through the 1990s and 2000s, Dyna variants proliferated: Dyna-2 (Silver et al.), prioritized sweeping, trajectory sampling. All used tabular or linear approximations. They worked in small, discrete state spaces and fell apart the moment states became continuous or observations were images.
The deep learning revolution changed the premise. If you could learn a model from raw pixels using a neural network, you could extend Dyna to arbitrary observation spaces. That is exactly what the next decade explored.
David Ha and Jürgen Schmidhuber’s paper World Models (arXiv 2018, presented at the NeurIPS Creativity workshop) is the founding document of modern world model research. It is simultaneously a research paper and a manifesto, and it remains the best introduction to the central idea.
Their architecture has three components, called V, M, and C:
V (Vision): Variational Autoencoder. A VAE compresses each frame $x_t \in \mathbb{R}^{64 \times 64 \times 3}$ into a latent vector $z_t \in \mathbb{R}^{32}$.
M (Memory): MDN-RNN. A mixture-density network built on an LSTM takes $(z_t, a_t)$ and predicts the distribution of the next latent $z_{t+1}$. Specifically it outputs the parameters of a mixture of Gaussians:
\[p(z_{t+1} \mid z_t, a_t, h_t) \;=\; \sum_{k=1}^{K} \pi_k \; \mathcal{N}(z_{t+1};\; \mu_k,\; \sigma_k^2)\]where $h_t$ is the LSTM hidden state. The stochastic sampling means the model does not collapse to the mean prediction; it can represent multi-modal futures.
C (Controller): Linear policy. A single linear layer maps $(z_t, h_t)$ to action $a_t$. It has very few parameters by design. The complexity sits in V and M, making C fast to optimize with CMA-ES.
The most provocative result: the agent can be trained entirely inside the dream. The M model generates imagined trajectories; C is optimized on them; the resulting policy transfers back to the real environment. In VizDoom, the agent trained 100% in dream outperformed the agent trained in reality because the dream is faster and more controllable.
“We can imagine ourselves performing some task that hasn’t yet happened, and evaluate the probable outcomes.”
Limitations that the next generation addressed:
Hafner, Lillicrap, Fischer, Villegas, Ha, Lee, and Davidson at Google Brain introduced PlaNet (deep planning network). The key insight: if you are going to plan, the world model should be explicitly designed for planning, not just for prediction.
PlaNet’s central contribution is the Recurrent State-Space Model (RSSM), which separates state into two complementary pathways:
\[h_t = f_\theta(h_{t-1},\, z_{t-1},\, a_{t-1}) \qquad \text{(deterministic GRU)}\] \[z_t \sim \begin{cases} q_\phi(z_t \mid h_t, o_t) & \text{posterior (with observation)} \\ p_\psi(z_t \mid h_t) & \text{prior (planning, no observation)} \end{cases}\]The deterministic $h_t$ provides long-range memory; the stochastic $z_t$ lets the model represent multiple possible futures. At planning time you use only the prior, rolling the model forward without touching the real environment.
Planning uses CEM (Cross-Entropy Method): sample $J$ action sequences, evaluate under the world model, keep the top $K$, refit the proposal. PlaNet achieves 50x better sample efficiency than model-free methods on six DMControl tasks.
Hafner, Lillicrap, Ba, and Norouzi replaced the CEM planner with a learned actor-critic in imagination. The world model is trained by optimizing the ELBO over sequences:
\[\mathcal{L}_\text{WM} = \mathbb{E}_{q}\!\left[\sum_t \underbrace{\log p(o_t \mid z_t, h_t)}_\text{reconstruction} + \underbrace{\log p(r_t \mid z_t, h_t)}_\text{reward pred.} - \underbrace{\beta\,\text{KL}\!\left[q(z_t \mid h_t, o_t) \;\|\; p(z_t \mid h_t)\right]}_\text{regularization}\right]\]The actor $\pi_\phi$ and critic $V_\psi$ are then trained purely on imagined rollouts from the RSSM using $\lambda$-returns:
\[V^\lambda_t = r_t + \gamma\Big[(1-\lambda) V_\psi(z_{t+1}, h_{t+1}) + \lambda V^\lambda_{t+1}\Big]\]This is a clean separation: real data shapes the model; imagined data shapes the policy. Dreamer established state-of-the-art on DMControl from pixels using 20x fewer environment interactions than D4PG.
Hafner, Lillicrap, Norouzi, and Ba replaced Gaussian latents with categorical representations: each $z_t$ is a vector of one-hot variables drawn from learned categorical distributions. The gradient flows through the Straight-Through (ST) estimator:
\[z_t \;\sim\; \text{Cat}\!\left(\text{softmax}(f_\phi(h_t, o_t))\right), \qquad \nabla_\phi \approx \nabla_\phi \hat{z}_t \big|_{\hat{z}_t \;\leftarrow\; z_t}\]where $\hat{z}t = z_t + \text{sg}(\text{softmax}(f\phi) - z_t)$ and $\text{sg}(\cdot)$ is stop-gradient. Categorical latents are sharper than Gaussians: each slot encodes a discrete, interpretable feature rather than a blurry mean.
DreamerV2 is the first model-based method to match Rainbow and IQN on the Atari 100k benchmark, using only 400 minutes of gameplay.
Hafner, Pasukonis, Ba, and Lillicrap pushed for a single agent that can handle any task without hyperparameter tuning. DreamerV3 introduces symlog predictions to handle rewards of vastly different scales:
\[\text{symlog}(x) = \text{sign}(x)\,\ln\!\left(|x| + 1\right), \qquad \text{symexp}(x) = \text{sign}(x)\!\left(e^{|x|} - 1\right)\]Targets are transformed with symlog before computing the loss; predictions are exponentiated back with symexp. A reward of $10^6$ and a reward of $0.01$ now live in the same comfortable range.
Additional improvements include KL balancing, free bits to prevent posterior collapse, Block GRU with RMSNorm and SiLU activations, and the LaProp optimizer.
The headline result: DreamerV3 collects diamonds in Minecraft from scratch, without human demonstrations or curricula, the first algorithm to do so. Diamonds require 20+ sequential crafting and mining steps spanning minutes of real time.
Micheli, Alonso, and Fleuret at Geneva replaced the RSSM with a transformer world model. IRIS tokenizes each frame with a VQ-VAE into $L$ discrete tokens, then predicts the next frame’s tokens, reward, and done-flag autoregressively:
\[\mathcal{L}_\text{WM} = -\sum_{t} \log p_\theta\!\left(v_{t+1}^{(1:L)},\, r_{t+1},\, d_{t+1} \;\Big|\; v_{\le t},\, a_{\le t}\right)\]where $v_t^{(1:L)}$ are the $L$ VQ-VAE tokens for frame $t$. This is exactly a language model loss over a (frame, action) vocabulary. The transformer’s global attention can, in principle, model arbitrarily long dependencies without the compounding error of recurrent models.
On Atari 100k, IRIS achieves a mean human-normalized score of 1.046, outperforming humans on 10 of 26 games.
Hansen, Su, and Wang at UC San Diego learned a task-oriented latent dynamics model and terminal value function jointly via temporal-difference learning. The world model is not required to reconstruct observations; it only predicts latent features that matter for value estimation. A single 317M-parameter agent generalizes across 80 tasks, multiple embodiments, and action spaces.
| Method | Single-task score | Multi-task (80 tasks) |
|---|---|---|
| SAC (model-free) | competitive | collapses |
| DreamerV3 | competitive | limited |
| TD-MPC2 (317M) | competitive | state of the art |
The RL-focused papers above all treat the world model as an internal component of an agent. A parallel thread asks: what if the world model is a standalone foundation model, pre-trained on massive video datasets, and then adapted to downstream tasks?
Wayve’s GAIA-1 (arXiv 2309.17080) is a 9-billion-parameter autoregressive video model trained on driving data. It factorizes future video autoregressively over discrete video tokens, conditioned on text $c$, past video, and ego-vehicle actions:
\[p(v_{1:T} \mid c,\, a_{1:T}) \;=\; \prod_{t=1}^T \prod_{\ell=1}^{L} p_\theta\!\left(v_t^{(\ell)} \;\Big|\; v_t^{(1:\ell-1)},\, v_{<t},\, a_{<t},\, c\right)\]GAIA-1 supports three conditioning modalities: video history, text descriptions, and ego-vehicle actions (steer, accelerate, brake). Scaling to 9B parameters showed that larger models produce more temporally consistent and physically plausible futures, echoing LLM scaling laws.
Google’s UniSim (arXiv 2310.06114) trained a 5.6B-parameter generative model to simulate real-world interactions across diverse domains: robot manipulation, navigation, human-object interaction. Given a starting frame and a high-level instruction $c$ (“open the drawer”), UniSim generates video of the resulting interaction by conditioning a video diffusion model:
\[p_\theta(x_{1:T} \mid x_0, c) \;=\; \prod_{t=1}^T p_\theta(x_t \mid x_{t-1},\, c)\]The goal is zero-shot sim-to-real transfer: train policies entirely inside the simulator, then deploy in the real world without fine-tuning.
Alonso, Jelley, Richard, and Bordes replaced the discrete-token world model with a diffusion process. DIAMOND trains a denoising diffusion model to generate the next frame:
\[\mathcal{L}_\text{DIAMOND} \;=\; \mathbb{E}_{k,\,\epsilon}\Big[\big\|\epsilon - \epsilon_\theta\!\left(x_{t+1}^{(k)},\, k,\, x_{t-K:t},\, a_t\right)\big\|^2\Big]\]where $x_{t+1}^{(k)} = \sqrt{\bar\alpha_k}\,x_{t+1} + \sqrt{1-\bar\alpha_k}\,\epsilon$ is the noisy target frame at diffusion step $k$, and $\epsilon_\theta$ is the denoising network conditioned on the past $K$ frames and current action.
Compressing into a small codebook (as IRIS does) discards visual information that helps the agent distinguish subtly different states. Diffusion preserves this detail at the cost of slower generation.
On Atari 100k, DIAMOND achieves a mean human-normalized score of 1.46, the new state of the art for agents trained entirely within a world model.
Valevski et al. (arXiv 2408.14837) fine-tuned Stable Diffusion to simulate DOOM in real time at 20 FPS on a single TPU. Each frame is generated by conditioning on past frames and the current action:
\[x_{t+1} \;\sim\; p_\theta\!\left(\cdot \;\Big|\; x_{t-K:t},\, a_t\right), \qquad \text{(diffusion ancestral sampling)}\]Human evaluators shown short clips could not distinguish GameNGen from real DOOM at above-chance accuracy. The model handles turning, strafing, enemy AI, damage, and door mechanics, all emergently from next-frame prediction.
Bruce, Dennis, Edwards, and colleagues at Google DeepMind (arXiv 2402.15391) trained Genie on 200,000 hours of internet gaming videos with no action labels. The 11B-parameter model has three components:
The key result: give Genie any image (photograph, sketch, synthetic render) and it generates a frame-by-frame controllable interactive world, where the action space was never labeled during training.
Genie 2 (December 2024) extended this to 3D. Genie 3 (August 2025, public January 2026), integrated into Waymo’s “Waymo World Model,” generates extreme long-tail driving scenarios (tornados, floods, unusual obstacles) from text prompts.
OpenAI’s Sora technical report positioned the model as a world simulator. Sora is a Diffusion Transformer (DiT): patches of video are treated as tokens, and a transformer denoises them jointly:
\[x_0 \approx \epsilon_\theta\!\left(x_t,\, t,\, c\right), \qquad x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]where $x_t$ is the noisy video, $t$ is the diffusion step, and $c$ is the text condition. The transformer operates on flattened spacetime patches, so it can model long-range temporal dependencies within a clip.
Whether Sora is “really” a world model is contested: it cannot be queried for actions, exposes no state space, and has no planning interface. But it generates 3D-consistent, causally plausible video up to one minute long. OpenAI published no architecture details.
In February 2022, Yann LeCun published “A Path Towards Autonomous Machine Intelligence.” His central critique: predicting pixels/tokens forces the model to expend capacity on irrelevant low-level details. What matters for reasoning is high-level abstract state, not pixel values.
His proposed alternative: predict an abstract embedding of the future from an abstract embedding of the present. The JEPA objective (simplified) is:
\[\min_{\theta, \phi} \; \mathbb{E}\Big[ \big\| s_y - \hat{s}_y \big\|^2 \Big], \qquad s_y = \text{enc}_\phi(y), \quad \hat{s}_y = \text{pred}_\theta(s_x, z)\]where $x$ is the context, $y$ is the target, $z$ is a latent variable carrying unpredictable information, and $\phi$ is shared between context and target encoders. A collapse-prevention mechanism (VICReg, masking, or EMA targets) ensures the encoder does not learn trivial constant representations.
The key contrast with generative models:
| Generative (Dreamer, GAIA) | JEPA | |
|---|---|---|
| Prediction target | $x_{t+1}$ (pixels/tokens) | $\text{enc}(x_{t+1})$ (representation) |
| Decoder required | yes | no |
| Capacity spent on | every pixel | only informative features |
| Collapse risk | low (reconstruction forces signal) | must prevent explicitly |
Assran, Duval, Misra, Balestriero, LeCun, and others at Meta introduced I-JEPA. A large fraction of image patches is masked; the predictor reconstructs their embeddings from the unmasked context. The target encoder is an exponential moving average (EMA) of the online encoder, providing stable targets:
\[\phi_\text{target} \;\leftarrow\; m\,\phi_\text{target} + (1-m)\,\phi_\text{online}, \qquad m \in [0.996, 1.0)\]I-JEPA matches or outperforms masked autoencoders (MAE) on ImageNet classification while using far fewer training iterations, because it does not waste capacity reconstructing pixels.
Bardes, Ponce, and LeCun extended I-JEPA to video by masking spatiotemporal tubes and predicting their embeddings. The tube-masking strategy:
\[\mathcal{L} = \frac{1}{|M|}\sum_{(i,j,t) \in M} \Big\|\text{pred}_\theta\!\left(\text{enc}_\phi(x_\text{context}),\, \text{pos}(i,j,t)\right) - \bar{s}_{i,j,t}\Big\|^2\]where $M$ is the set of masked spatiotemporal positions and $\bar{s}_{i,j,t}$ is the EMA-target embedding at that position. V-JEPA trains on 2+ million unlabeled videos and learns physical and temporal regularities without explicit supervision.
V-JEPA 2 (June 2025) combined 1M hours of internet video with robot trajectory data, demonstrating physical reasoning and short-horizon planning for robotic manipulation.
In late 2025, Yann LeCun confirmed the launch of Advanced Machine Intelligence (AMI) Labs, explicitly building on JEPA. The accompanying paper “Value-guided action planning with JEPA world models” (arXiv 2601.00844) augments the JEPA world model with a learned value function for multi-step planning:
\[\pi^*(s_0) = \arg\max_{a_{0:H-1}} \sum_{t=0}^{H-1} \gamma^t V_\psi(\hat{s}_t), \qquad \hat{s}_{t+1} = \text{pred}_\theta(\hat{s}_t, a_t)\]Planning happens entirely in representation space: roll out the predictor $H$ steps, score with $V_\psi$, select the action sequence that maximizes predicted value. No pixel generation is needed. A 15M-parameter JEPA model trained on a single GPU achieves competitive navigation and manipulation results, motivating AMI’s $3-5 billion valuation.
NVIDIA released Cosmos (arXiv 2501.03575) as a family of open-weight World Foundation Models for physical AI: Cosmos-Predict (generate future video from multi-modal inputs), Cosmos-Transfer (domain adaptation for robotics/AV), and Cosmos-Reason (vision-language model with physical AI reasoning trained on 200M curated video clips).
| System | Org | Scale | Domain | Key contribution |
|---|---|---|---|---|
| DreamerV3 | 200M | Any RL env | Fixed hyperparams, Minecraft diamonds | |
| IRIS | Geneva | - | Atari | Transformer WM, 1.046 HNS |
| DIAMOND | Edinburgh/Paris | - | Atari | Diffusion WM, 1.46 HNS |
| TD-MPC2 | UC San Diego | 317M | Continuous ctrl | TD latent model |
| GAIA-1/2 | Wayve | 9B | Driving | Autoregressive driving video |
| GameNGen | - | Doom | Real-time game simulation | |
| Genie/2/3 | DeepMind | 11B+ | Interactive env | Unsupervised action space |
| Cosmos | NVIDIA | variable | Robotics/AV | Physical AI foundation model |
| V-JEPA 2 | Meta | large | Video/robotics | Rep.-space prediction |
| AMI/LeWorldModel | AMI Labs | 15M+ | General | JEPA + value planning |
Every world model degrades as you roll it further into the future. If the per-step prediction error is $\epsilon$ and the dynamics are $L$-Lipschitz, the error at horizon $H$ is bounded by:
\[\epsilon_H \;\leq\; \sum_{t=1}^{H} L^{H-t}\,\epsilon_t \;\leq\; \frac{L^H - 1}{L - 1}\,\epsilon\]For $L > 1$ (which is generic), the error bound grows exponentially with horizon. This is why DreamerV3 uses 15-step imagined rollouts: the error after 100 steps is catastrophic for policy optimization.
The gap: language models maintain coherence over thousands of tokens because grammar and semantics tightly constrain adjacent tokens. Physical states are not so constrained; a 1-pixel perturbation can cascade into a completely different trajectory.
Most world model benchmarks give the agent near-complete visual access to the environment. Real tasks involve partial observability, requiring a belief state $b_t$ that is updated as new observations arrive:
\[b_{t+1}(s') \;\propto\; p(o_{t+1} \mid s')\!\sum_{s} T(s,\, a_t,\, s')\,b_t(s)\]The RSSM handles this via the stochastic $z_t$ pathway, but a single Gaussian vector cannot represent the full posterior over multiple distinct hypotheses about occluded objects. Mixture-of-belief architectures remain an open research direction.
A world model trained on observational data learns correlations, not causes. Pearl’s causal hierarchy defines three levels:
| Level | Operation | Example |
|---|---|---|
| L1: observation | $p(Y \mid X)$ | “Rain predicts traffic jams” |
| L2: intervention | $p(Y \mid \text{do}(X))$ | “Seeding road with ice causes jams” |
| L3: counterfactual | $p(Y_x \mid X’, Y’)$ | “Would there have been a jam if I hadn’t intervened?” |
Current world models operate at L1 and partially at L2 (actions are treated as do-operators). L3 is essentially open for learned models.
Can a world model trained in an office environment correctly simulate a kitchen it has never seen? The challenge is compositional: the model knows about cups, knows about kitchens, but has not seen their combination. Systematic compositional generalization is a known failure mode of deep networks and especially acute for world models that must maintain physical consistency across novel combinations.
LeCun’s critique of pixel-prediction is largely accepted. But what is the right abstract prediction space? Comparing error growth across representations:
\[\text{RSSM (Gaussian):}\;\; \mathcal{N}(\mu, \sigma^2) \qquad \text{DreamerV2 (categorical):}\;\; \text{Cat}(K \times C) \qquad \text{JEPA (EMA):}\;\; \text{enc}(x) \in \mathbb{R}^d\]No consensus exists. Each trades off expressiveness, training stability, and planning compatibility differently.
Genie and JEPA-style models learn rich world representations but lack stable action grounding: the learned latent action space is not aligned with real physical controls. The ideal alignment would satisfy:
\[\text{enc}(\text{env after doing }a) \;\approx\; \text{pred}(\text{enc}(\text{env}),\; a)\]Bridging the representation-learning world model (JEPA branch) and the RL world model (Dreamer branch) is the most pressing open problem in the field.
These are concrete, scoped projects doable on a single GPU in days to weeks, each addressing an open problem from the list above.
Re-implement the RSSM from PlaNet in under 300 lines of PyTorch. Train on Cartpole Swingup from pixels. The key evaluation: plot prediction error vs. horizon for the prior (planning mode, no observations) vs. the posterior (inference mode, with observations):
\[\text{pred. error at H} = \mathbb{E}\!\left[\|o_t - \hat{o}_t\|^2\right], \quad t \in \{1, 5, 10, 20, 50\}\]This directly visualizes the compounding-error problem from Open Problem 1.
Portfolio signal: deep MBRL implementation; quantified understanding of latent dynamics.
Train a transformer world model (VQ-VAE tokenizer + small GPT) on a MiniAtari or Minigrid environment. Implement the sequence modeling loss $\mathcal{L}_\text{WM}$ from the IRIS section. Plot sample efficiency curves: imagined-rollout agent vs. model-free PPO baseline.
Portfolio signal: transformer-based MBRL; discrete representation learning.
Take a pre-trained DreamerV3 world model on one Atari game. Intervene on the latent $z_t$ of a specific frame (shift object position by $\delta$ in latent space), roll forward, and use the resulting trajectories as counterfactual augmentation. Measure robustness improvement:
\[\Delta_\text{robust} = \text{success rate with aug.} - \text{success rate without aug.}\]Portfolio signal: causal intervention via latents; intersection of world models and data augmentation.
Train DreamerV3 but maintain a per-step confidence $c_t \in [0, 1]$ that decays as imagined rollouts lengthen. Weight the $\lambda$-return by confidence:
\[\tilde{V}^\lambda_t = c_t \cdot V^\lambda_t + (1 - c_t) \cdot V_\psi(z_t, h_t)\]Test whether this prevents the policy from exploiting model errors in long-horizon rollouts.
Portfolio signal: practical safety in MBRL; directly addresses Open Problem 1.
Train two small world models on the same dataset: one RSSM + decoder (pixel prediction), one RSSM without decoder (JEPA-style, predict next latent). Compare planning performance on a downstream task at the same rollout budget $H$:
Portfolio signal: direct experimental test of the LeCun hypothesis.
Apply a JEPA-style world model to text-conditioned robot motion. Train on video of human actions + text captions. At test time: given a command, roll the world model forward over candidate action sequences, score with CLIP alignment reward, execute the best:
\[a^*_{0:H} = \arg\max_{a_{0:H}} \sum_{t=0}^H \text{CLIP-sim}\!\left(\text{pred}_\theta(\hat{s}_t, a_t),\; \text{enc}_\text{text}(c)\right)\]No action labels needed during training.
Portfolio signal: multimodal world models; language conditioning; zero-shot transfer.
A world model is useful only if its uncertainty estimates are trustworthy. Build an evaluation suite measuring calibration of a world model: at horizon $H$, what fraction of true future observations falls within the predicted $p$% confidence interval?
\[\text{Calib. error} = \left|\hat{P}(o_{t+H} \in C_p) - p\right|, \qquad p \in \{50\%, 80\%, 95\%\}\]Test across DreamerV3 and IRIS on multiple Atari games. Identify which games and which horizons have the worst calibration. Measurement papers of this kind are highly valued at NeurIPS and ICLR.
Portfolio signal: principled evaluation methodology; the often-neglected calibration problem in MBRL.
This survey covers the main lineage from Sutton’s 1990 Dyna through the foundation world models of 2024-2025. The field is moving fast: every top venue (NeurIPS, ICML, ICLR, EMNLP for language-grounded variants, CVPR/ICCV for vision) now has a dedicated world model track or workshop. If you want to find the frontier, watch the ICLR 2025 workshop on world models and the NeurIPS 2025 proceedings.