Papers
Topics
Authors
Recent
2000 character limit reached

WorldTest Framework Evaluation

Updated 29 October 2025
  • WorldTest Framework is a protocol that assesses an agent's ability to build predictive models through reward-free interaction and testing in altered environments.
  • By decoupling the exploration phase from the testing phase, the framework isolates true generalization performance across both artificial and human agents.
  • AutumnBench, the first implementation, benchmarks tasks like masked-frame prediction, planning, and change detection to evaluate world-model learning.

The WorldTest Framework is a protocol for evaluating world-model learning in agents, explicitly designed to measure the capacity to acquire, generalize, and deploy predictive models of environment dynamics. WorldTest departs from traditional evaluation by decoupling reward-free environment interaction from a subsequent scored test phase in a modified, but related, environment. The framework is open-ended, behavior-based, and agnostic to model representation, enabling rigorous comparison across artificial and human agents. Its first concrete instantiation, AutumnBench, offers a comprehensive suite for benchmarking world-model acquisition and generalization.

1. Motivation and Rationale

The origin of WorldTest is grounded in the need for systematic, rigorous evaluation of agents' world-model learning—the ability to build predictive, flexible, and counterfactual models supporting transfer to unseen downstream tasks. Existing benchmarks focus on next-frame prediction or reward maximization in static environments, conflating training and testing, and often predisposing agents to specialize or memorize for narrow objectives. Current methods typically use non-interactive or model-dependent diagnostics, impeding direct comparison between architectures and against humans. There are no prior standards that comprehensively evaluate how well agents generalize their learned environment understanding to novel but related challenges. WorldTest addresses this by providing a representation-agnostic, interactive, and task-agnostic protocol for evaluating the true breadth of world-model knowledge.

2. Design Principles

WorldTest is defined by four central design tenets:

  1. Interactive Exploration: Agents are permitted active interfacing with environments, as opposed to learning from static datasets, enabling both interventions and hypothesis testing.
  2. Behavior-Based (Black-box) Evaluation: Agents are evaluated solely on externally exhibited behavior during test tasks; their internal state, learning mechanisms, and representations are not probed or assumed.
  3. Goal-Free Interaction Phase: During exploration, there are no extrinsic rewards or explicit tasks, mirroring how humans explore without immediate objectives.
  4. Testing in Modified Environments: The evaluation challenge is posed in a variant of the original environment—altered objectives, changed dynamics, or new observables—constructed after the exploration phase.

This architecture ensures the resulting metric reflects the depth and adaptability of the acquired world-model, not memorized policy sequences or overfitting to preset tasks.

3. Protocol and Evaluation Structure

WorldTest divides agent interaction into a dual-phase procedure:

  • Interaction Phase—Agents interact with a reward-free partially observable Markov decision process (POMDP), M=S,A,O,T,Ω\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{T}, \Omega \rangle, with the liberty to reset and repeat episodes, but no access to downstream objectives. This phase is explicitly designed to prompt general world-model formation.
  • Test Phase—Upon termination of exploration, a hidden task generator τ\tau samples a derived challenge (M,R,H)(\mathcal{M}', R, H) from M\mathcal{M} and random seed ξPΞ\xi \sim P_\Xi, formally

τ(M,ξ)(M,R,H)\tau(\mathcal{M}, \xi) \to (\mathcal{M}', R, H)

Agents, equipped solely with any internal representations learned in the interaction phase, must solve explicit tasks in this new environment, with performance scored on objective outcomes rather than policy similarity or next-step accuracy.

This structure explicitly enforces generalization; agents encounter previously unseen challenge specifications, precluding scripted or overfitted policy solutions.

4. Openness and Representation Agnosticism

WorldTest is designed to be:

  • Open-ended: There is no predetermined set of downstream tasks. The protocol is constructed to support evaluation independent of any specific pre-specified objectives.
  • Representation-Agnostic: Any agent type—neural, symbolic, hybrid, or human—can be compared. The test is indifferent to internal structure, requiring only observable behavior in the derived challenge.
  • Extensible: The framework can readily incorporate novel challenge generators, new POMDP variations, and additional environment classes, supporting ongoing evolution as world-model learning advances.

This facilitates uniform measurement of world-model learning regardless of advances in underlying agent architecture or environment design.

5. Instantiation: AutumnBench

AutumnBench is the first comprehensive implementation of WorldTest, offering:

  • Environments: 43 interactive grid-worlds defined in the Autumn DSL (for 2D POMDPs), varying in grid size (3×3 to 25×25), object diversity, palette, stochasticity, simple physics, multi-object interactions, and structure inspired by classic games and tool use.
  • Challenge Tasks: For each environment, three distinct families of scoring challenges are posed, requiring broad world-model capabilities:

    1. Masked-Frame Prediction (MFP): Infer (multiple choice) a masked region in the final trajectory observation:

    $\text{score} = \mathbbm{1}[\text{correct}]$

2. Planning: Generate an action sequence to transform the initial state into a presented goal configuration in a subgrid:

$\text{score} = \mathbbm{1}[\, o_T|_\mathcal{S} = g \,]$

3. Change Detection (CD): Detect the (unknown) time step in which environment dynamics shift:

score(t)={0,t<t1 1,t{t1,t} 1.377ft(t)1.178,otherwise\text{score}(t) = \begin{cases} 0, & t < t^*-1 \ 1, & t \in \{t^*-1, t^*\} \ 1.377 f_{t^*}(t) - 1.178, & \text{otherwise} \end{cases}

where tt^* is the true change point and ft(t)f_{t^*}(t) defines penalty shape.

  • Scale: 129 total challenge problems across the three families.
  • Human and Agent Evaluation: 517 Prolific participants and three frontier LLMs (Anthropic Claude 4 Sonnet, OpenAI o3, Google Gemini 2.5 Pro) tested via matched interfaces. Models interact via a text interface with observation, action, and history feedback.

6. Comparative Assessment and Main Empirical Findings

WorldTest supports black-box, behavior-only scoring, providing direct comparisons between humans and artificial agents, as well as among varied algorithmic approaches. Task score, exploration patterns, reset frequency, sample efficiency, and normalized perplexity are utilized as metrics. Normalized perplexity,

Perplexitynorm=2H(p)1K1\text{Perplexity}_\text{norm} = \frac{2^{H(p)}-1}{K-1}

where H(p)H(p) is entropy over action distributions and KK is the number of unique actions, encodes the focus and systematicity of exploration.

Key results from AutumnBench:

  • Humans outperform all current reasoning models, often approaching perfect scores where models perform near random baselines.
  • Increasing inference compute improves model performance in only 25 of 43 environments; in 18 environments, scaling has no effect. Especially for change detection, additional computation does not yield discernible improvement.
  • Human exploration is characterized by systematic structure, frequent use of resets for strategic experimentation, and adaptive behavior on encountering altered dynamics. Humans demonstrate lower normalized perplexity, indicative of more systematic exploration.
  • Current models under-utilize resets, rarely update beliefs in the face of contradictory evidence, and struggle with causal experimentation in test environments.
  • Substantial performance headroom remains between model and human baselines, highlighting gaps in current world-model learning.

7. Properties and Significance

A summary of core WorldTest features, as realized in AutumnBench, is given below:

Property WorldTest/AutumnBench
Interactive
Behavior-based (Black-box)
Representation Agnostic
Goal-Free Exploration
Modified Test Environment
Human/Model Comparison

The framework establishes a rigorous, extensible reference template for world-model evaluation—generalizing beyond grid environments—and isolates world-model quality from training paradigms, architecture specifics, or environment overfitting.


WorldTest and its instantiation, AutumnBench, establish a new standard for the open, extensible, and comparative benchmarking of world-model learning. By dissociating learning from immediate task supervision and enforcing generalization to previously hidden challenges, it enables diagnosis of environment understanding, exposes principal limitations in current agent methodologies, and supports advancement in both artificial and human-like world-model learning.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to WorldTest Framework.