Benchmarking World-Model Learning
Abstract: Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$\unicode{x2014}$models should support many different tasks unknown ahead of time$\unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$\unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about how to fairly and effectively test whether AI “agents” learn good world models—internal understandings of how a world works so they can predict, plan, and adapt. The authors introduce a new testing method called WorldTest and a set of video-game-like environments called AutumnBench to see how well different agents, including humans, learn and use these world models.
Objectives
The paper asks a simple question: How should we evaluate world-model learning in a way that:
- Lets agents interact with a changing world instead of just watching examples,
- Judges them by their behavior (what they do), not by a specific internal format (like pixels or programs),
- Encourages goal-free exploration at first (no scores or rewards),
- Then tests them in a related but different environment to see if they truly learned general rules?
Methods
Think of a world model like the mental map you build of your kitchen: where things are, how appliances behave, and what changes when you cook in a different kitchen. A good world model helps you:
- Predict hidden things (how cooked food is under a lid),
- Notice when the rules change (the knives are in a different drawer),
- Plan steps to achieve goals (finish a recipe efficiently).
The authors designed a two-phase evaluation, called WorldTest:
- Interaction Phase: Agents freely explore an environment without any rewards or scores—like wandering in a new kitchen to learn where things are and how they work.
- Test Phase: Agents are then tested in a different but related environment (think: a similar kitchen with some changes). The tasks are varied and not told upfront, so the agent’s world model must be general and flexible.
To make this work, they created AutumnBench:
- A collection of 43 simple, interactive “grid worlds” (like small puzzle-video-game boards) and 129 tasks.
- Tasks fall into three families:
- Masked-frame prediction: The agent predicts missing parts of the world, like guessing what’s behind a wall or what a hidden tile looks like.
- Planning: The agent decides sequences of actions to reach a goal, similar to figuring out the right steps to complete a puzzle.
- Change detection in causal dynamics: The agent notices when the “rules” of the world change, like a door that used to open now stays locked.
Importantly, the evaluation is behavior-based. This means they don’t force the agent to use a specific type of internal representation (like predicting next video frames). Instead, they judge the agent by the outcomes: Can it solve the tasks after exploring?
They compared:
- 517 human participants,
- Three advanced AI models, across all AutumnBench tasks.
Main Findings
Here are the main results and why they matter:
- Humans outperform current AI models: Even without explicit training on the tasks, humans build flexible world models that let them do well across different types of challenges.
- Scaling compute helps, but inconsistently: Making AI models bigger or giving them more computation improves performance in some environments, but not others. This suggests that merely scaling up is not enough; we need better ways for agents to learn world dynamics.
- There’s lots of room for improvement: Across many tasks, models lag behind humans, showing that world-model learning is still an open problem.
Why this is important: Many existing benchmarks measure success by narrow criteria—like predicting the next image frame or maximizing rewards in the same environment they trained in. That can miss whether an agent has truly learned the underlying rules of the world. WorldTest and AutumnBench focus on generalization to new but related settings, which is closer to real-world intelligence.
Implications
This research provides a new blueprint for evaluating world-model learning:
- Reward-free exploration encourages agents to learn general rules instead of overfitting to a single objective.
- Testing in different-but-related environments checks whether the learning actually transfers, which matters in real life when situations change.
- Behavior-based scoring makes comparisons fair across different agent types (including humans), without forcing a specific internal format.
If widely adopted, this approach could:
- Drive the creation of AI systems that understand and adapt to changing environments more like humans do,
- Reveal which model designs truly learn world dynamics,
- Help the AI community track meaningful progress toward general intelligence, not just performance on narrow tasks.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of gaps that remain unresolved and that could guide future work:
- External validity: How well do findings transfer from discrete, 2D grid-worlds to richer settings (continuous control, 3D physics, multi-object dynamics, partial observability, and multimodal inputs like vision and language)?
- Task coverage: The benchmark currently spans three task families (masked-frame prediction, planning, change-detection). What additional task types are needed to stress-test world models (e.g., counterfactual queries, causal intervention planning, long-horizon forecasting, commonsense reasoning, temporal abstraction, and compositional skill reuse)?
- Domain shift specification: The protocol tests in “different but related” environments, but there is no formal measure of relatedness. How should similarity/divergence between interaction and test environments be parameterized and reported (e.g., causal structure overlap, transition/reward perturbation magnitude, object-set shifts)?
- Exploration strategy assessment: Which reward-free exploration objectives (curiosity, empowerment, information gain, novelty search, uncertainty-driven probing) most reliably yield transferable world models in this setting? The paper does not systematically compare exploration strategies.
- Sample efficiency and scaling laws: The interaction budget and compute scaling effects are not analyzed systematically. What are the scaling laws with respect to interaction steps, model size, planner compute, and environment complexity, and how do these interact?
- Drivers of scaling variability: Compute helps in some environments but not others; the causal factors (stochasticity, partial observability, branching factor, causal sparsity, reward sparsity, change frequency) remain unidentified.
- Planner–model confounding: Planning performance may be limited by the planner rather than the learned model. How can the benchmark isolate model quality from planner quality (e.g., standardized planners, planner-agnostic probes, or model-usage diagnostics)?
- Proxy-task validity: The relationship between masked-frame prediction accuracy and true dynamics understanding is unclear. When does success on masking translate to better causal/temporal understanding, and when can it be solved by heuristics?
- Change-detection granularity: The taxonomy, subtlety, and effect sizes of causal changes are not characterized. How sensitive are agents to small vs. large causal shifts, and what are false positive/negative rates across change types?
- Uncertainty and calibration: The benchmark does not measure whether agents learn well-calibrated epistemic uncertainty or exploit it in planning. Can uncertainty-aware scoring (e.g., proper scoring rules, risk-sensitive returns) be incorporated?
- Robustness to non-stationarity: Beyond discrete change-detection tasks, how robust are learned models to gradual drifts, cyclic changes, adversarial perturbations, or latent confounders across episodes?
- Continual and lifelong learning: Does knowledge persist across many environments without catastrophic forgetting? The benchmark does not test retention, consolidation, or transfer after sequential exposure to multiple domains.
- Rapid adaptation/meta-learning: Are agents allowed or able to adapt their models online during the test phase? The speed–accuracy trade-off for test-time adaptation is not evaluated or standardized.
- Human–AI comparison controls: Humans bring extensive priors. How should instructions, time limits, training, and feedback be controlled to produce commensurate comparisons, and how should inter-human variability be reported?
- Interface and modality bias: Although behavior-based, the I/O interfaces may advantage certain architectures (e.g., token-based LLMs vs. pixel-based models). How can interfaces be standardized or diversified to reduce modality bias?
- Compute- and data-normalized metrics: Results are not normalized by wall-clock, energy, or interaction steps. How should efficiency metrics (performance per FLOP/second/step) be integrated into scoring to enable fair comparisons?
- Psychometric analysis of tasks: Task difficulty and discrimination are not calibrated. Can item-response theory or similar analyses be used to estimate a latent “world-model quality” trait and to design balanced task sets?
- Cross-task metric alignment: It is unknown whether the three task families load onto a single latent factor of “world-model quality.” What is the correlation structure among tasks, and which are redundant vs. complementary?
- Overfitting and task leakage: Open-ended tests risk agents tuning exploration to anticipated tasks. How can the protocol incorporate hidden tasks, holdout environment families, or procedurally generated tests to reduce leakage?
- Interpretability and causal structure extraction: The framework does not evaluate whether learned models encode interpretable causal mechanisms. Can probes or interventions assess whether agents recover latent objects, relations, and dynamics?
- Rare-event and long-tail dynamics: The benchmark does not explicitly test learning from rare transitions or handling black-swan events. How can rare-event sampling and evaluation be incorporated without excessive interaction budgets?
- Multi-agent and social dynamics: The environments are single-agent and non-social. How should the benchmark extend to cooperative/competitive settings and theory-of-mind requirements for modeling other agents?
- Memory and external tools: The role of episodic memory, working memory limits, and external tool use (e.g., scratchpads, planners, simulators) in world-model learning is not characterized or controlled.
- Versioning and extensibility: Procedures for adding new environments/tasks while preserving comparability and backward compatibility are not specified (e.g., benchmark versioning, hidden test sets, difficulty tiers).
- Reproducibility details: Standardized seeds, environment variants, API constraints, and reporting protocols (e.g., ablations, confidence intervals, multiple-comparison controls) are not fully articulated in the presented excerpt.
- Language priors and cross-modal transfer: The benchmark does not test whether linguistic/world knowledge can accelerate model learning, nor how visual-LLMs transfer priors into interactive dynamics understanding.
- Safety and specification gaming: Behavior-based scoring may be susceptible to unintended shortcuts. How can the benchmark detect and discourage gaming or degenerate strategies that bypass genuine model learning?
Practical Applications
Immediate Applications
The paper introduces WorldTest (a behavior-based, reward-free exploration protocol with evaluation in a related but different environment) and AutumnBench (43 interactive environments; 129 tasks spanning masked-frame prediction, planning, and causal-dynamics change detection). The following applications can be adopted now:
- Benchmarking and model selection for world-model learning (software/AI)
- Use WorldTest + AutumnBench to compare model-based RL, LLM-agents, and hybrid systems in a black-box, behavior-scored way.
- Tools/products: internal leaderboards, evaluation dashboards, agent “gates” in CI/CD.
- Assumptions/dependencies: teams can run the open benchmark; mapping benchmark performance to target domain remains a judgment call.
- MLOps quality assurance for agentic systems (enterprise software, RPA)
- Add a WorldTest-inspired stage to pre-deployment QA: reward-free exploration in a sandboxed app, followed by tests on a modified app version to assess adaptation and planning.
- Tools/products: Agent QA harness, sandboxed UI simulators, pass/fail release criteria.
- Assumptions/dependencies: ability to build lightweight UI “twins” with controlled modifications.
- Safety and red-teaming of AI agents (cross-sector)
- Use behavior-based tests to detect brittle planning, poor change detection, or shortcutting that reward-only metrics miss.
- Tools/products: safety test batteries, model cards incorporating world-model scores.
- Assumptions/dependencies: agreed thresholds for “competency,” organizational buy-in.
- Robotics simulation tests for learning and adaptation (robotics)
- Port AutumnBench-style tasks to robot simulators to quickly vet agents’ planning and dynamics-change detection before hardware trials.
- Tools/products: world-model test suite for Gazebo/Isaac, pre-flight checklists.
- Assumptions/dependencies: available robot/digital-twin simulators; sim-to-real gap persists.
- Distribution-shift and anomaly detection evaluation (autonomy, manufacturing, IT ops)
- Use “predicting changes to causal dynamics” tasks as a proxy for detecting process drift or environment changes and for selecting agents robust to shifts.
- Tools/products: scenario libraries with controlled perturbations, alert calibration.
- Assumptions/dependencies: realistic perturbation generators; alignment with domain KPIs.
- Data imputation and partial observability stress tests (healthcare, finance, IoT)
- Treat masked-frame prediction as a stand-in for missing-sensor imputation or incomplete records; compare agents’ ability to infer unobserved states.
- Tools/products: offline imputation benchmarks, model comparisons for EHR/time-series.
- Assumptions/dependencies: domain datasets; privacy/compliance for sensitive data.
- Curriculum resources for teaching and research (education, academia)
- Deploy AutumnBench in courses to teach world models, exploration, and generalization; reproduce the human/model comparison to study cognitive gaps.
- Tools/products: teaching labs, homework auto-graders, open competitions.
- Assumptions/dependencies: students’ access to compute; institutional approvals for human studies.
- Cautionary deployment checks based on empirical findings (policy, product)
- Given humans outperform frontier models and scaling helps unevenly, require world-model evaluations beyond reward maximization before deployment in complex settings.
- Tools/products: procurement checklists, internal policy for “competency under shift.”
- Assumptions/dependencies: acceptance by governance/risk committees; clear pass criteria.
- Agent procurement and vendor evaluation (public sector, regulated industries)
- Include WorldTest-style capability assessments in RFPs to compare black-box vendor agents.
- Tools/products: standardized test suites and reporting templates.
- Assumptions/dependencies: sector-specific scenario curation; legal interoperability of tests.
- Human-in-the-loop system design (all sectors)
- Use results to route high-stakes or shift-prone tasks to humans or hybrid workflows until agent competencies improve.
- Tools/products: task triage policies, escalation rules driven by world-model scores.
- Assumptions/dependencies: reliable measurement pipelines; workforce readiness.
Long-Term Applications
These leverage the protocol’s design principles (reward-free exploration, cross-environment testing, behavior-based scoring) but require further research, scaling, or domain adaptation:
- Sector standards and certification for general-purpose agents (policy, regulation)
- Establish world-model competency standards with cross-environment tests for safety-critical deployments (e.g., clinical, aviation, autonomy).
- Tools/products: accredited certification labs; compliance frameworks.
- Assumptions/dependencies: consensus on benchmarks; regulatory adoption; liability models.
- Sim-to-real world-model validation for autonomous systems (robotics, drones, self-driving)
- Agents explore digital twins to learn dynamics, then are evaluated in altered twins (layout changes, sensor faults) before staged real-world rollout.
- Tools/products: WorldTest-for-digital-twins, phased release pipelines.
- Assumptions/dependencies: high-fidelity twins; robust transfer; safe sandboxing in the real world.
- Adaptive grid and industrial operations (energy, manufacturing)
- World-model-driven agents for planning under contingencies (equipment outages, demand spikes), with routine tests on perturbed scenarios to prevent drift.
- Tools/products: continuous capability monitoring; shift-aware controllers.
- Assumptions/dependencies: integration with SCADA/EMS; rigorous safety cases.
- Autonomous scientific discovery and lab automation (R&D, biotech)
- Lab agents that explore instruments and protocols to learn causal dynamics, then plan novel experiments and detect setup changes.
- Tools/products: lab twin environments; experiment planners validated under shifts.
- Assumptions/dependencies: reliable simulators; experimental safety controls.
- Personalized tutoring and education agents (education technology)
- Tutors that build world models of student knowledge via exploration and adapt to curricular changes; evaluated on out-of-distribution tasks.
- Tools/products: learner-model diagnostics, adaptive curricula validated by shifted tests.
- Assumptions/dependencies: ethical data use; pedagogical validation; fairness audits.
- Logistics and emergency response planning (public safety, supply chain)
- Agents that learn city/network dynamics, then replan under disruptions (road closures, demand surges) validated through controlled scenario shifts.
- Tools/products: scenario banks; resilience drills with agent-in-the-loop.
- Assumptions/dependencies: access to real-time data; coordination with authorities.
- Financial strategy agents robust to structural breaks (finance)
- Certification of planning/imputation under regime changes (policy shocks, liquidity crises) using WorldTest-like evaluation.
- Tools/products: stress-testing harnesses, model risk governance artefacts.
- Assumptions/dependencies: realistic market simulators; stringent risk controls.
- Clinical decision-support agents with change-awareness (healthcare)
- Require agents to pass masked/shifted evaluations (new guidelines, pathogen variants) before deployment; continuous monitoring with derived tests.
- Tools/products: clinical twin testbeds; shift-aware CDS pipelines.
- Assumptions/dependencies: regulatory approval; clinical validation; privacy and safety.
- Agent operating systems with continuous capability auditing (software platforms)
- “WorldTest-as-a-service” embedded in agent platforms: ongoing reward-free exploration on product surfaces with periodic, unknown downstream tests to measure generalization.
- Tools/products: platform-level capability scores; auto-regression on capability drift.
- Assumptions/dependencies: scalable evaluation infrastructure; test set secrecy/rotation.
- Cognitive assessment inspired by AutumnBench (human factors, HR)
- Adapt tasks to measure human causal reasoning, change detection, and planning for training or assessment in specific roles.
- Tools/products: standardized cognitive task batteries.
- Assumptions/dependencies: psychometric validation; fairness and legal considerations.
Notes on cross-cutting dependencies
- Domain adaptation: translating grid-world tasks to realistic simulators or digital twins is necessary for many sectors.
- Validity and metrics: behavior-based scores must correlate with domain outcomes; external validation will be needed.
- Data/compliance: healthcare/finance deployments require privacy, auditability, and regulatory alignment.
- Compute and tooling: running open-ended, repeated evaluations at scale requires efficient infrastructure and test rotation to prevent overfitting.
- Human oversight: given current findings (humans outperform; scaling helps unevenly), human-in-the-loop and guardrails remain essential in safety-critical contexts.
Glossary
- AutumnBench: A suite of interactive grid-world environments and tasks used to instantiate the WorldTest evaluation protocol. "We instantiated WorldTest with {\em AutumnBench}, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics."
- behavior-based scoring: Scoring agents purely by observed behavior rather than internal representations or proxy metrics. "WorldTest provides a novel template---reward-free exploration, derived tests, and behavior-based scoring---to evaluate what agents learn about environment dynamics,"
- black-box: An evaluation stance where an agent’s internal workings are hidden and only inputs/outputs are assessed. "behavior-based, meaning the framework treats the agent as a black-box and evaluates only by its external behavior"
- causal dynamics: The cause-effect mechanisms that govern how an environment’s state transitions occur. "predicting changes to the causal dynamics."
- causal graphs: Graphical models that encode causal relationships among variables. "such as next-frame prediction, programs, or causal graphs"
- counterfactual: Reasoning about what would happen under hypothetical changes to conditions or actions. "Cognitive science refers to this flexible, predictive, and counterfactual understanding as world model"
- derived tests: Evaluation tasks constructed from or following an agent’s prior interaction, not fixed a priori. "WorldTest provides a novel template---reward-free exploration, derived tests, and behavior-based scoring---"
- downstream task: A task used after an unsupervised or goal-free phase to evaluate what was learned. "Then, the agent is tested with a downstream task in the same environment."
- environment dynamics: The rules governing how an environment changes in response to states and actions (including transitions and rewards). "a world model as a representation of environment dynamics---most commonly a function mapping histories of states and actions to predictions of future states and rewards"
- frontier models: State-of-the-art, large-scale models at the leading edge of capability. "We compared 517 human participants and three frontier models on AutumnBench."
- grid-world environments: Discrete, grid-based interactive environments often used for planning and RL research. "a suite of 43 interactive grid-world environments"
- Gym-like benchmarks: Benchmarks modeled after OpenAI Gym with reward-driven tasks and standardized interfaces. "Gym-like benchmarks resemble OpenAI Gym in that they model decision processes with explicit objectives such as rewards"
- high-level predicates: Symbolic statements (e.g., about objects or relations) used as structured representations for modeling and evaluation. "requiring models to use high-level predicates (e.g., object relations or causal effects)"
- LLM-based evaluation: Using LLMs to assess the quality of model outputs or reasoning. "using LLM-based evaluation for text-based models"
- masked-frame prediction: Predicting missing or occluded frames in a sequence based on observed context. "three families: masked-frame prediction, planning, and predicting changes to the causal dynamics."
- model representation: The internal format or structure in which a learned model encodes knowledge. "agnostic to model representation, allowing comparison across approaches."
- model-learning agents: Agents that actively acquire models of their environments through interaction. "Model-learning agents should gather information to learn world models"
- next-frame prediction: Predicting the immediate next observation (e.g., image frame) from recent history. "training and evaluation are anchored to next-frame prediction"
- non-interactive benchmarks: Benchmarks where environments do not evolve over time or in response to actions; agents infer rules from static examples. "Non-interactive benchmarks test whether agents can infer underlying rules from examples and generalize to novel test cases"
- OpenAI Gym: A widely used toolkit and API standard for reinforcement learning environments. "Gym-like benchmarks resemble OpenAI Gym in that they model decision processes with explicit objectives such as rewards"
- pixel-level reconstruction error: A visual-model metric computed by comparing predicted and true images at the pixel level. "measuring pixel-level reconstruction error for visual models"
- predicate prediction accuracy: The correctness of predicting symbolic predicates about a scene or dynamics. "evaluating predicate prediction accuracy"
- proxy measures: Indirect metrics used as stand-ins for desired capabilities, which may not reflect true competence. "The reliance on potentially inadequate proxy measures further constrains evaluation"
- representation-based approaches: Evaluation methods that constrain models to specific representational formats and assess within those formats. "Representation-based approaches evaluate world models by requiring them to use specific, predefined formats"
- reward maximization: Optimizing behavior to accumulate as much reward as possible. "success is scored by reward maximization in the same environment."
- reward-free exploration: Exploration without access to or optimization of reward signals. "WorldTest provides a novel template---reward-free exploration, derived tests, and behavior-based scoring---"
- reward-free interaction: An interaction phase where agents gather information without reward signals. "separates reward-free interaction from a scored test phase in a different but related environment."
- scaling compute: Increasing computational resources (e.g., model size, training time) to improve performance. "scaling compute improves performance only in some environments but not others."
- scored test phase: A distinct evaluation phase in which agent performance is quantitatively measured after interaction. "separates reward-free interaction from a scored test phase in a different but related environment."
- two-phase protocol: An evaluation design with separate interaction (often goal-free) and subsequent testing phases. "by evaluating agents using a two-phase protocol"
- white-box model: A model whose internal structure and representations are exposed to evaluators. "these methods require a white-box model and prevent comparison between different model types or assessment against human performance."
- world model: An internal model of how an environment evolves and responds to actions, used for prediction and planning. "a world model as a representation of environment dynamics"
- world-model learning: The process of acquiring a world model from interaction or data. "approaches to evaluating world-model learning"
- WorldTest: A protocol that evaluates model-learning agents via reward-free interaction followed by testing in a related environment. "We propose {\em WorldTest }, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment."
Collections
Sign up for free to add this paper to one or more collections.