AutumnBench: World-Model Evaluation
- AutumnBench is a model-agnostic benchmark suite that assesses world-model learning in both humans and artificial agents using interactive grid-world tasks.
- It operationalizes evaluation by separating free-form exploratory interactions from scored test phases using the WorldTest protocol, focusing on behavior-based metrics and out-of-distribution challenges.
- Experimental results reveal that current models lag behind human performance, emphasizing the need for flexible, hypothesis-driven strategies in adaptive model construction.
AutumnBench is a model-agnostic benchmark suite for evaluating world-model learning in both humans and artificial agents. It is instantiated within the WorldTest protocol, which separates reward-free exploratory interaction from scored test phases in derived environments. AutumnBench operationalizes world-model learning assessment via a diverse set of interactive grid-world environments and tasks, focusing on behavior-based metrics and explicitly out-of-distribution generalization challenges. The benchmark has exposed fundamental limitations in current model-based reasoning approaches, revealing substantial headroom in matching human-level adaptive model construction and inference.
1. WorldTest Protocol and Motivation
WorldTest is designed to address the disconnect between contemporary world-model evaluation practices—typically anchored to next-frame prediction and in-environment reward maximization—and the broader goal of equipping agents to build flexible, generalizable models of environment dynamics. The protocol consists of two phases:
- Interaction Phase: The agent interacts with a base POMDP environment , free of external reward or explicit objectives. The agent is allowed to act, reset, and explore without constraints and can choose when to end exploration.
- Test Phase: A deterministic transformation produces a challenge environment , a reward function , and horizon , unknown during interaction. The agent deploys its previously learned model to solve tasks in this modified environment, and is scored purely on behavioral performance.
This methodology tests both latent model construction and open-ended generalization, supporting fair comparison between systems and human participants.
2. AutumnBench Design and Environment Suite
AutumnBench comprises 43 interactive grid-world environments, specified via the Autumn DSL—a functional reactive language for 2D causal interactions. These environments vary in:
- Grid size: to , with as the default.
- Object and color diversity: Approximately five types per environment, 1–12 colors.
- Stochasticity: 19/43 environments sample stochastic transitions.
- Domains: Physics simulations, emergent multi-agent systems, logic puzzles, Nim-like games, tool-use tasks, and phenomena such as sandcastle construction.
- Design desiderata: Structural novelty, human intuitiveness, and diverse dynamics, with extensibility for new domains and challenge types.
The design is intended to foster assessment of agent adaptability and transferable structural inference across a broad range of dynamical systems.
3. Task Families and Challenge Types
Each environment in AutumnBench is associated with three challenge types (totaling 129 tasks):
- Masked Frame Prediction (MFP): Given partially masked action/observation sequences, agents select which of six candidate masked regions in the final observation is correct.
- Scoring: $\text{score} = \mathbbm{1}[\text{correct}]$
- Planning: Agents attempt to reach a specified target configuration or goal state in a designated grid subregion.
- Scoring: $\text{score} = \mathbbm{1}[\,o_T|_{\mathcal{S} = g\,]$
- Change Detection (CD): During test, the environment's transition rule changes at an unknown time. The agent must report when its observations become impossible under the original dynamics.
- Change point: $t^* = \min \left\{ t \geq 1 : \mathbb{P}_{\mathcal{M}(o_{0:t} = o'_{0:t} | a_{1:t}) = 0,\ \mathbb{P}_{\mathcal{M}(o_{0:t-1} = o'_{0:t-1} | a_{1:t-1}) > 0 \right\}$
- Scoring function:
where
This multidimensional task design probes both direct model prediction accuracy and the depth of agent reasoning about causality and temporal change.
4. Metrics for Behavioral and Model Learning Assessment
AutumnBench quantifies agent learning and behavior via:
Task Scores: Binary or smooth values based on challenge completion (see formulas above).
Normalized Perplexity: Measures the focus of an agent’s action distribution during exploration—random = 1, deterministic = 0.
- where , is the entropy of the agent’s action distribution, and is the size of the action alphabet.
- AUC of Perplexity: Area under the perplexity curve, capturing the efficiency of learning focus (lower is better).
- Exploration Strategies: Analysis of resets, no-ops, and targeted experiment design.
These metrics enable direct comparison between human strategic experiment design and artificial agent action focus.
5. Experimental Evaluation: Humans vs Reasoning Models
AutumnBench has been used to benchmark the performance of 517 human participants (via Prolific) and three frontier large reasoning models (OpenAI o3, Anthropic Claude 4 Sonnet, Google Gemini 2.5 Pro) across all 129 challenge tasks:
- Human Baseline: 80th-percentile aggregate score; average ~0.94 overall.
- Model Performance: Typically lower than humans and highly variable. SOTA models perform better in stochastic than deterministic environments.
- Scaling Effects: In ~58% of AutumnBench environments, scaling model compute improved performance. In the remainder, scaling yielded no improvement—performance plateaued, suggesting limits of existing architectures.
- Taskwise Analysis:
- Masked Frame Prediction: Models at or near chance; humans nearly perfect.
- Planning: Some solvable for models, but inconsistency and frequent failure.
- Change Detection: Most challenging for models; human performance near perfect, model scores often near zero.
- Exploration Behavior: Humans utilize resets and no-ops extensively for hypothesis testing, whereas models rarely employ such strategies, indicating less flexible experiment design.
6. Implications for World-Model Learning and AI Evaluation
The results from AutumnBench highlight persistent gaps between human and model-based world-model learning:
- Representation-agnostic behavioral evaluation provides a clearer diagnostic than architecture-specific or reward-based metrics.
- Reward-free exploration with out-of-distribution testing reveals agent limitations in adaptive model construction, compositional causal reasoning, and meta-cognitive exploration.
- Model scaling is not sufficient to overcome fundamental reasoning bottlenecks; behavioral analysis shows humans focus action entropy more rapidly and generalize more flexibly.
- Strategic experiment design and hypothesis updating remain open challenges for artificial agents.
A plausible implication is that future advances in world-model learning must go beyond scale and encompass flexible, human-like learning trajectories, hypothesis management, and broad causal inference strategies.
7. Accessibility and Extensibility
AutumnBench is publicly accessible via https://autumn.basis.ai and designed for extensibility through the Autumn DSL. This enables direct access to the environments, agents, scoring protocols, and task definitions for both human and machine studies. The framework is intended to support future research in world-model learning diagnostics and robust agent development across a range of domains (Warrier et al., 22 Oct 2025).