Crafter Benchmark Overview

Updated 17 October 2025

Crafter is a benchmark that assesses agent performance in open-ended, visually rich survival tasks using 22 achievement milestones.
The environment challenges agents with both extrinsic rewards and intrinsic motivations to tackle sparse rewards and long-horizon credit assignment.
Innovative methods like object-centric modeling, replay prioritization, and dynamic world modeling drive state-of-the-art advances in performance.

Crafter is a benchmark environment for evaluating the spectrum of agent capabilities in open-ended, visually rich simulation tasks. It is implemented as a procedurally generated survival game where agent performance is measured by a diverse set of semantically meaningful achievements spanning resource collection, crafting, survival, construction, and combat. Research employing Crafter has delineated the challenges of generalization, exploration, long-horizon credit assignment, and sample efficiency, and has produced a body of methodological advances and architectural insights into both reward-driven and intrinsically motivated RL paradigms.

1. Environment Design and Achievement-Based Evaluation

Crafter consists of a single procedurally generated world with visual observations and a flat categorical action space. The agent interacts via discrete actions (e.g., movement, resource collection, crafting, combat) and must contend with environmental hazards, sparse resources, and a deep technology tree of dependent achievements.

Each episode lasts up to 10,000 steps or until agent death. There are 22 distinct achievements, each corresponding to interpretable milestones such as collecting wood, stone, coal, diamond; crafting pickaxes/swords; eating food; drinking water; defeating skeletons/zombies; constructing tables/furnaces; and planting/harvesting crops.

Performance is quantified by episode-level achievement success rates:

For each achievement $i$ , the success rate %%%%1%%%% is the fraction of episodes in which $i$ is unlocked at least once.
The overall score $S$ is computed via the geometric mean:

$S \doteq \exp\left(\frac{1}{N} \sum_{i=1}^{N} \ln(1 + s_i)\right) - 1$

where $N=22$ . This rewards agents that achieve competence across the full spectrum of tasks, weighting progress on rare/difficult achievements higher than repetition of simple ones.

2. Reward Modes: Extrinsic vs Intrinsic Objectives

Crafter supports two principal learning paradigms:

Extrinsic reward: Agents receive $+1$ reward for unlocking each achievement for the first time per episode, plus health-based rewards. The environment therefore exposes very sparse rewards—most transitions yield zero reward, emphasizing the challenge of long-horizon exploration and credit assignment.
Intrinsic objectives: Agents are trained with no environment-provided extrinsic reward, but instead rely on intrinsic motivation systems such as curiosity-driven exploration (e.g., Random Network Distillation, Plan2Explore). Evaluation remains via achievement success.

This unified protocol allows direct comparison of reward-driven agents with unsupervised or exploration-based agents, facilitating research on representation learning, exploration bonuses, and unsupervised RL in open-ended domains.

3. Generalization, Object-Centric Models, and Task Variants

Crafter has motivated research on robust generalization and fast adaptation. The CrafterOOD suite (Stanić et al., 2022) introduces two types of distributional shift:

CrafterOODapp: alters the visual appearance (color, texture) of key objects (trees, cows, zombies) between training and evaluation, testing visual robustness.
CrafterOODnum: varies object counts (resource/enemy density), with training and evaluation conducted under differing distributions.

Baseline RL agents—PPO (feedforward, recurrent), DreamerV2—show sharp drops in OOD settings, particularly under zero-shot appearance or count shifts. Object-centric agents employing self- or cross-attention over localized image patches (OC-SA, OC-CA) generalize significantly better by representing objects in modular, interpretable slots:

$\text{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right) V$

where $Q,K,V$ are projections of image patches and $d$ is the embedding dimension. Visualization of attention maps confirms that these agents attend to salient objects and inventory states, reflecting the challenges of partial observability and memory in the environment.

4. Sample Efficiency, Replay Prioritization, and Exploration

Sparse rewards and complex dependency chains make sample efficiency and efficient credit assignment central challenges. Curious Replay (Kauvar et al., 2023) introduces replay prioritization for DreamerV3 agents by combining count-based novelty and world-model error:

$p_i = c \cdot \beta^{v_i} + \left(|\mathcal{L}_i| + \epsilon\right)^{\alpha}$

where $v_i$ is the replay count of experience $i$ , $\mathcal{L}_i$ is the model’s prediction loss for $i$ , and $c,\beta,\alpha,\epsilon$ are hyperparameters. This prioritizes replay of transitions that are both novel and high-error, accelerating adaptation to new achievements and phases in the tech tree. Evaluation shows a $\sim$ 1.33 $\times$ improvement in achievement score over uniform replay.

Change-Based Exploration Transfer (CBET) (Ferrao et al., 26 Mar 2025) introduces intrinsic motivation via visitation counts and state-change rarity, improving DreamerV3’s returns in Crafter (though not in Minigrid, where induced exploration may misalign with task objectives).

5. Benchmarks, Extensions, and Instruction-Following

Crafter has inspired extensions and alternative benchmarks:

Craftax (Matthews et al., 26 Feb 2024): a JAX-based reimplementation, achieving $250\times$ speedup over Crafter, supporting scalable vectorized experimentation (1 billion steps on a single GPU) and introducing multi-floor procedurally generated worlds, advanced combat, and attribute systems. Craftax serves as the substrate for multimodal instruction-following benchmarks such as CrafText (Volovikova et al., 17 May 2025).
CrafText: a formal evaluation protocol for instruction following, featuring 3924 instructions with 3423 unique words over localization, conditional, building, and achievement tasks. Evaluations occur in a goal-based POMDP, measuring both paraphrase and object generalization. The optimization objective is:

$\pi^* = \arg\max_{\pi}~\mathbb{E}_{\pi}\left[\sum_{t=0}^{T} \gamma^t R(s_t, a_t, g) \mid o_0 \right]$

This assesses agents’ ability to ground, sequence, and dynamically adapt instructions in a volatile environment.

CrafterDojo (Park et al., 19 Aug 2025): a suite integrating foundation models (CrafterVPT for expert behavior, CrafterCLIP for vision-language alignment, CrafterSteve-1 for instruction following), automated dataset generators (CrafterPlay, CrafterCaption), and standardized evaluation pipelines. Foundation models leverage TransformerXL architectures, contrastive learning, and classifier-free guidance for goal embedding and policy conditioning, facilitating reproducible research on embodied agents and multimodal tasks.

6. World Modeling Innovations and State-of-the-Art Agents

Crafter has been a rigorous testbed for contemporary world modeling approaches:

Δ-IRIS (Micheli et al., 27 Jun 2024): uses context-aware tokenization, encoding only stochastic deltas between frames in discrete tokens, drastically reducing sequence lengths and accelerating training (%%%%20 $p_i = c \cdot \beta^{v_i} + \left(|\mathcal{L}_i| + \epsilon\right)^{\alpha}$ 21%%%% faster) while reaching new state-of-the-art performance metrics (solve 17/22 tasks at 10M frames, average score 16.1).
EMERALD (Burchi et al., 5 Jul 2025): employs spatial latent states and MaskGIT parallel decoding in transformers, surpassing human expert performance (achievement score $58.1\%$ vs $50.5\%$ human), unlocking all 22 Crafter achievements within 10M steps, and maintaining high efficiency (27 FPS on RTX 3090).
EDELINE (Lee et al., 1 Feb 2025): fuses diffusion-based visual prediction with linear-time sequence modeling (Mamba SSM), overcoming memory bottlenecks of fixed-context windows. It obtains 11.5 mean return at 1M steps (25% higher than DreamerV3 XL, with 11M parameters).
DyMoDreamer (Zhang et al., 29 Sep 2025): introduces dynamic modulation via inter-frame differencing masks, which extract pixel-level motion cues and generate dynamic modulators encoded as categorical distributions, yielding a 9.5% improvement on Crafter by enriching reward-relevant temporal information.
TransDreamerV3 (Dongare et al., 20 Jun 2025): replaces GRU in the RSSM with a transformer encoder, addressing memory decay and supporting parallel, long-range updates—delivering faster reward acquisition and higher metric scores on Crafter.

7. Significance and Research Impact

Crafter has served as a pivotal environment for identifying shortcomings and validating advances in sample efficiency, generalization, and hierarchical planning. Its unique achievement-centric evaluation protocol and support for both extrinsic and intrinsic learning paradigms make it a canonical benchmark for studying the emergence of generalist agent capabilities.

Key implications include:

The challenge of robust generalization in procedural, multimodal worlds, highlighted by CrafterOOD results.
The importance of memory—both architectural (recurrence, attention) and external (inventory, scene state)—for handling partial observability and long-term dependencies.
The continuous evolution of world modeling techniques, with improvements in tokenization, spatial latent designs, replay prioritization, and dynamic modulation yielding substantial sample efficiency and state-of-the-art returns.
The integration with multimodal benchmarks (e.g., instruction following), foundation models, and hierarchical architectures as a foundation for research toward general-purpose embodied agents.

Crafter continues to catalyze experimentation across exploration, unsupervised learning, vision-language grounding, and sequential planning, positioning itself as the benchmark of choice for evaluating agent generality, adaptability, and multimodal reasoning in open-ended environments.