WorldTest Framework Evaluation
- WorldTest Framework is a protocol that assesses an agent's ability to build predictive models through reward-free interaction and testing in altered environments.
- By decoupling the exploration phase from the testing phase, the framework isolates true generalization performance across both artificial and human agents.
- AutumnBench, the first implementation, benchmarks tasks like masked-frame prediction, planning, and change detection to evaluate world-model learning.
The WorldTest Framework is a protocol for evaluating world-model learning in agents, explicitly designed to measure the capacity to acquire, generalize, and deploy predictive models of environment dynamics. WorldTest departs from traditional evaluation by decoupling reward-free environment interaction from a subsequent scored test phase in a modified, but related, environment. The framework is open-ended, behavior-based, and agnostic to model representation, enabling rigorous comparison across artificial and human agents. Its first concrete instantiation, AutumnBench, offers a comprehensive suite for benchmarking world-model acquisition and generalization.
1. Motivation and Rationale
The origin of WorldTest is grounded in the need for systematic, rigorous evaluation of agents' world-model learning—the ability to build predictive, flexible, and counterfactual models supporting transfer to unseen downstream tasks. Existing benchmarks focus on next-frame prediction or reward maximization in static environments, conflating training and testing, and often predisposing agents to specialize or memorize for narrow objectives. Current methods typically use non-interactive or model-dependent diagnostics, impeding direct comparison between architectures and against humans. There are no prior standards that comprehensively evaluate how well agents generalize their learned environment understanding to novel but related challenges. WorldTest addresses this by providing a representation-agnostic, interactive, and task-agnostic protocol for evaluating the true breadth of world-model knowledge.
2. Design Principles
WorldTest is defined by four central design tenets:
- Interactive Exploration: Agents are permitted active interfacing with environments, as opposed to learning from static datasets, enabling both interventions and hypothesis testing.
- Behavior-Based (Black-box) Evaluation: Agents are evaluated solely on externally exhibited behavior during test tasks; their internal state, learning mechanisms, and representations are not probed or assumed.
- Goal-Free Interaction Phase: During exploration, there are no extrinsic rewards or explicit tasks, mirroring how humans explore without immediate objectives.
- Testing in Modified Environments: The evaluation challenge is posed in a variant of the original environment—altered objectives, changed dynamics, or new observables—constructed after the exploration phase.
This architecture ensures the resulting metric reflects the depth and adaptability of the acquired world-model, not memorized policy sequences or overfitting to preset tasks.
3. Protocol and Evaluation Structure
WorldTest divides agent interaction into a dual-phase procedure:
- Interaction Phase—Agents interact with a reward-free partially observable Markov decision process (POMDP), , with the liberty to reset and repeat episodes, but no access to downstream objectives. This phase is explicitly designed to prompt general world-model formation.
- Test Phase—Upon termination of exploration, a hidden task generator samples a derived challenge from and random seed , formally
Agents, equipped solely with any internal representations learned in the interaction phase, must solve explicit tasks in this new environment, with performance scored on objective outcomes rather than policy similarity or next-step accuracy.
This structure explicitly enforces generalization; agents encounter previously unseen challenge specifications, precluding scripted or overfitted policy solutions.
4. Openness and Representation Agnosticism
WorldTest is designed to be:
- Open-ended: There is no predetermined set of downstream tasks. The protocol is constructed to support evaluation independent of any specific pre-specified objectives.
- Representation-Agnostic: Any agent type—neural, symbolic, hybrid, or human—can be compared. The test is indifferent to internal structure, requiring only observable behavior in the derived challenge.
- Extensible: The framework can readily incorporate novel challenge generators, new POMDP variations, and additional environment classes, supporting ongoing evolution as world-model learning advances.
This facilitates uniform measurement of world-model learning regardless of advances in underlying agent architecture or environment design.
5. Instantiation: AutumnBench
AutumnBench is the first comprehensive implementation of WorldTest, offering:
- Environments: 43 interactive grid-worlds defined in the Autumn DSL (for 2D POMDPs), varying in grid size (3×3 to 25×25), object diversity, palette, stochasticity, simple physics, multi-object interactions, and structure inspired by classic games and tool use.
- Challenge Tasks: For each environment, three distinct families of scoring challenges are posed, requiring broad world-model capabilities:
- Masked-Frame Prediction (MFP): Infer (multiple choice) a masked region in the final trajectory observation:
$\text{score} = \mathbbm{1}[\text{correct}]$
2. Planning: Generate an action sequence to transform the initial state into a presented goal configuration in a subgrid:
$\text{score} = \mathbbm{1}[\, o_T|_\mathcal{S} = g \,]$
3. Change Detection (CD): Detect the (unknown) time step in which environment dynamics shift:
where is the true change point and defines penalty shape.
- Scale: 129 total challenge problems across the three families.
- Human and Agent Evaluation: 517 Prolific participants and three frontier LLMs (Anthropic Claude 4 Sonnet, OpenAI o3, Google Gemini 2.5 Pro) tested via matched interfaces. Models interact via a text interface with observation, action, and history feedback.
6. Comparative Assessment and Main Empirical Findings
WorldTest supports black-box, behavior-only scoring, providing direct comparisons between humans and artificial agents, as well as among varied algorithmic approaches. Task score, exploration patterns, reset frequency, sample efficiency, and normalized perplexity are utilized as metrics. Normalized perplexity,
where is entropy over action distributions and is the number of unique actions, encodes the focus and systematicity of exploration.
Key results from AutumnBench:
- Humans outperform all current reasoning models, often approaching perfect scores where models perform near random baselines.
- Increasing inference compute improves model performance in only 25 of 43 environments; in 18 environments, scaling has no effect. Especially for change detection, additional computation does not yield discernible improvement.
- Human exploration is characterized by systematic structure, frequent use of resets for strategic experimentation, and adaptive behavior on encountering altered dynamics. Humans demonstrate lower normalized perplexity, indicative of more systematic exploration.
- Current models under-utilize resets, rarely update beliefs in the face of contradictory evidence, and struggle with causal experimentation in test environments.
- Substantial performance headroom remains between model and human baselines, highlighting gaps in current world-model learning.
7. Properties and Significance
A summary of core WorldTest features, as realized in AutumnBench, is given below:
| Property | WorldTest/AutumnBench |
|---|---|
| Interactive | ✓ |
| Behavior-based (Black-box) | ✓ |
| Representation Agnostic | ✓ |
| Goal-Free Exploration | ✓ |
| Modified Test Environment | ✓ |
| Human/Model Comparison | ✓ |
The framework establishes a rigorous, extensible reference template for world-model evaluation—generalizing beyond grid environments—and isolates world-model quality from training paradigms, architecture specifics, or environment overfitting.
WorldTest and its instantiation, AutumnBench, establish a new standard for the open, extensible, and comparative benchmarking of world-model learning. By dissociating learning from immediate task supervision and enforcing generalization to previously hidden challenges, it enables diagnosis of environment understanding, exposes principal limitations in current agent methodologies, and supports advancement in both artificial and human-like world-model learning.