World Models
World models are AI systems that build internal representations of how the environment works, enabling machines to simulate, plan, and reason about future states without requiring direct experience.
World models are a class of AI systems that construct an internal representation of the environment — learning how the world works rather than merely mapping inputs to outputs. By building a compressed model of reality, an AI agent can simulate future states, test hypothetical actions, and plan without needing to act physically in the world. The concept draws on cognitive science, where the brain is understood to maintain internal simulations of the external environment that guide anticipation and decision-making.
The term was popularised in the deep learning era by David Ha and Juergen Schmidhuber in their 2018 paper "World Models," which demonstrated that a neural network could learn a compressed latent representation of a video game environment and then train a controller entirely within that imagined space. This was an early demonstration that an agent could "dream" — planning inside a simulated model rather than the real environment.
Architecture and Core Concepts
A world model typically consists of three components. A perception module encodes raw sensory input into a compact latent representation. A dynamics model predicts how that latent state will evolve over time given a sequence of actions. A reward or value model estimates the desirability of predicted future states.
At inference time, the agent uses the dynamics model to "roll out" hypothetical action sequences into the future, evaluating outcomes without executing those actions in the real environment. This is fundamentally how humans reason about consequences: mental simulation rather than physical trial and error.
Model-Based versus Model-Free Reinforcement Learning
Classical reinforcement learning is often divided into model-free and model-based approaches. Model-free agents such as those trained with proximal policy optimisation learn a policy directly from experience without explicitly modelling the environment. Model-based agents — which use world models — achieve substantially better sample efficiency because they can extract information from fewer real-world interactions.
The trade-off is that learned world models introduce compounding error: inaccuracies in the dynamics model accumulate over long rollout horizons, potentially misleading the planner. Recent research focuses on uncertainty-aware world models that quantify when the model is reliable and fall back to real-world interaction when necessary.
Joint Embedding Predictive Architecture
Yann LeCun at Meta AI proposed an alternative to standard generative models in his 2022 position paper "A Path Towards Autonomous Machine Intelligence." He argued that predicting raw sensory observations — pixels, audio frames — is unnecessarily hard and generates poorly grounded representations. Instead, he proposed the Joint Embedding Predictive Architecture (JEPA), in which a model predicts the abstract latent representation of a future observation rather than the observation itself.
JEPA avoids the need for pixel-level reconstruction and focuses learning on semantically meaningful structure. Variants include Image-JEPA (I-JEPA, 2023) and Video-JEPA (V-JEPA, 2024), both developed at Meta FAIR and trained on large unlabelled image and video datasets. These models learn robust visual representations without contrastive negatives or human annotations.
Recent Developments
Following Yann LeCun's departure from Meta in late 2025, he founded AMI Labs, which raised over USD 1 billion in a funding round announced in March 2026. AMI Labs focuses on building world model AI systems that can reason, plan, and act in physical environments — a direct contrast to the autoregressive token-prediction paradigm dominant in current large language models.
LeWorldModel, introduced in 2026, demonstrated a stable end-to-end JEPA trained directly from pixels on a single GPU, addressing long-standing instability issues in joint embedding approaches. This result suggested that scalable world model training no longer required prohibitive compute budgets.
World models have also become central to autonomous driving. Waymo, Tesla, and Chinese autonomous vehicle companies use learned world models to simulate rare edge cases — such as unusual traffic patterns or adverse weather — that would be impractical to encounter during real-world data collection. Model-based simulation enables safe testing of policy changes at scale.
Applications
In game-playing AI, world models underpin systems such as MuZero (DeepMind, 2020), which mastered chess, Go, and Atari games without being given the rules, learning both the dynamics of the environment and an optimal policy from scratch. MuZero uses a learned model to plan via Monte Carlo Tree Search in latent space, achieving superhuman performance with significantly fewer environment interactions than prior methods.
In robotics, world models enable manipulation systems to plan multi-step sequences — grasping, rotating, and placing objects — by imagining the physical consequences of each action before executing. This is essential for dexterous manipulation in unstructured environments, which requires anticipating how objects will deform or slip.
In drug discovery, latent-space world models allow researchers to simulate how molecular modifications will affect binding affinity and toxicity, guiding the design of candidate compounds before expensive wet-lab synthesis.
Relationship to Large Language Models
Large language models such as GPT-4 and Claude can be interpreted as implicit world models of textual knowledge: they encode factual associations and causal relationships in their weights, enabling a limited form of reasoning through language. However, critics including LeCun argue that text-trained models lack grounding in physical reality — they do not represent space, time, or causality in the way a model trained on sensorimotor experience would.
Efforts to bridge this gap include multimodal models trained on video, embodied AI agents that receive visual and proprioceptive inputs, and simulation environments such as Habitat and Isaac Lab that provide rich physical grounding for learned world models.
See Also
References
- Ha, D., and Schmidhuber, J. (2018). World Models. arXiv:1803.10122.
- LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview preprint.
- Assran, M., et al. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. CVPR 2023.
- Schrittwieser, J., et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 588, 604-609.
- AMI Labs. (2026). AMI Labs Launch Announcement. ami-labs.ai.