Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Crafter Benchmark Overview

Updated 17 October 2025
  • Crafter is a benchmark that assesses agent performance in open-ended, visually rich survival tasks using 22 achievement milestones.
  • The environment challenges agents with both extrinsic rewards and intrinsic motivations to tackle sparse rewards and long-horizon credit assignment.
  • Innovative methods like object-centric modeling, replay prioritization, and dynamic world modeling drive state-of-the-art advances in performance.

Crafter is a benchmark environment for evaluating the spectrum of agent capabilities in open-ended, visually rich simulation tasks. It is implemented as a procedurally generated survival game where agent performance is measured by a diverse set of semantically meaningful achievements spanning resource collection, crafting, survival, construction, and combat. Research employing Crafter has delineated the challenges of generalization, exploration, long-horizon credit assignment, and sample efficiency, and has produced a body of methodological advances and architectural insights into both reward-driven and intrinsically motivated RL paradigms.

1. Environment Design and Achievement-Based Evaluation

Crafter consists of a single procedurally generated world with visual observations and a flat categorical action space. The agent interacts via discrete actions (e.g., movement, resource collection, crafting, combat) and must contend with environmental hazards, sparse resources, and a deep technology tree of dependent achievements.

Each episode lasts up to 10,000 steps or until agent death. There are 22 distinct achievements, each corresponding to interpretable milestones such as collecting wood, stone, coal, diamond; crafting pickaxes/swords; eating food; drinking water; defeating skeletons/zombies; constructing tables/furnaces; and planting/harvesting crops.

Performance is quantified by episode-level achievement success rates:

  • For each achievement ii, the success rate %%%%1%%%% is the fraction of episodes in which ii is unlocked at least once.
  • The overall score SS is computed via the geometric mean:

Sexp(1Ni=1Nln(1+si))1S \doteq \exp\left(\frac{1}{N} \sum_{i=1}^{N} \ln(1 + s_i)\right) - 1

where N=22N=22. This rewards agents that achieve competence across the full spectrum of tasks, weighting progress on rare/difficult achievements higher than repetition of simple ones.

2. Reward Modes: Extrinsic vs Intrinsic Objectives

Crafter supports two principal learning paradigms:

  • Extrinsic reward: Agents receive +1+1 reward for unlocking each achievement for the first time per episode, plus health-based rewards. The environment therefore exposes very sparse rewards—most transitions yield zero reward, emphasizing the challenge of long-horizon exploration and credit assignment.
  • Intrinsic objectives: Agents are trained with no environment-provided extrinsic reward, but instead rely on intrinsic motivation systems such as curiosity-driven exploration (e.g., Random Network Distillation, Plan2Explore). Evaluation remains via achievement success.

This unified protocol allows direct comparison of reward-driven agents with unsupervised or exploration-based agents, facilitating research on representation learning, exploration bonuses, and unsupervised RL in open-ended domains.

3. Generalization, Object-Centric Models, and Task Variants

Crafter has motivated research on robust generalization and fast adaptation. The CrafterOOD suite (Stanić et al., 2022) introduces two types of distributional shift:

  • CrafterOODapp: alters the visual appearance (color, texture) of key objects (trees, cows, zombies) between training and evaluation, testing visual robustness.
  • CrafterOODnum: varies object counts (resource/enemy density), with training and evaluation conducted under differing distributions.

Baseline RL agents—PPO (feedforward, recurrent), DreamerV2—show sharp drops in OOD settings, particularly under zero-shot appearance or count shifts. Object-centric agents employing self- or cross-attention over localized image patches (OC-SA, OC-CA) generalize significantly better by representing objects in modular, interpretable slots:

Attention(Q,K,V)=softmax(QKd)V\text{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right) V

where Q,K,VQ,K,V are projections of image patches and dd is the embedding dimension. Visualization of attention maps confirms that these agents attend to salient objects and inventory states, reflecting the challenges of partial observability and memory in the environment.

4. Sample Efficiency, Replay Prioritization, and Exploration

Sparse rewards and complex dependency chains make sample efficiency and efficient credit assignment central challenges. Curious Replay (Kauvar et al., 2023) introduces replay prioritization for DreamerV3 agents by combining count-based novelty and world-model error:

pi=cβvi+(Li+ϵ)αp_i = c \cdot \beta^{v_i} + \left(|\mathcal{L}_i| + \epsilon\right)^{\alpha}

where viv_i is the replay count of experience ii, Li\mathcal{L}_i is the model’s prediction loss for ii, and c,β,α,ϵc,\beta,\alpha,\epsilon are hyperparameters. This prioritizes replay of transitions that are both novel and high-error, accelerating adaptation to new achievements and phases in the tech tree. Evaluation shows a \sim1.33×\times improvement in achievement score over uniform replay.

Change-Based Exploration Transfer (CBET) (Ferrao et al., 26 Mar 2025) introduces intrinsic motivation via visitation counts and state-change rarity, improving DreamerV3’s returns in Crafter (though not in Minigrid, where induced exploration may misalign with task objectives).

5. Benchmarks, Extensions, and Instruction-Following

Crafter has inspired extensions and alternative benchmarks:

  • Craftax (Matthews et al., 26 Feb 2024): a JAX-based reimplementation, achieving 250×250\times speedup over Crafter, supporting scalable vectorized experimentation (1 billion steps on a single GPU) and introducing multi-floor procedurally generated worlds, advanced combat, and attribute systems. Craftax serves as the substrate for multimodal instruction-following benchmarks such as CrafText (Volovikova et al., 17 May 2025).
  • CrafText: a formal evaluation protocol for instruction following, featuring 3924 instructions with 3423 unique words over localization, conditional, building, and achievement tasks. Evaluations occur in a goal-based POMDP, measuring both paraphrase and object generalization. The optimization objective is:

π=argmaxπ Eπ[t=0TγtR(st,at,g)o0]\pi^* = \arg\max_{\pi}~\mathbb{E}_{\pi}\left[\sum_{t=0}^{T} \gamma^t R(s_t, a_t, g) \mid o_0 \right]

This assesses agents’ ability to ground, sequence, and dynamically adapt instructions in a volatile environment.

  • CrafterDojo (Park et al., 19 Aug 2025): a suite integrating foundation models (CrafterVPT for expert behavior, CrafterCLIP for vision-language alignment, CrafterSteve-1 for instruction following), automated dataset generators (CrafterPlay, CrafterCaption), and standardized evaluation pipelines. Foundation models leverage TransformerXL architectures, contrastive learning, and classifier-free guidance for goal embedding and policy conditioning, facilitating reproducible research on embodied agents and multimodal tasks.

6. World Modeling Innovations and State-of-the-Art Agents

Crafter has been a rigorous testbed for contemporary world modeling approaches:

  • Δ-IRIS (Micheli et al., 27 Jun 2024): uses context-aware tokenization, encoding only stochastic deltas between frames in discrete tokens, drastically reducing sequence lengths and accelerating training (%%%%20pi=cβvi+(Li+ϵ)αp_i = c \cdot \beta^{v_i} + \left(|\mathcal{L}_i| + \epsilon\right)^{\alpha}21%%%% faster) while reaching new state-of-the-art performance metrics (solve 17/22 tasks at 10M frames, average score 16.1).
  • EMERALD (Burchi et al., 5 Jul 2025): employs spatial latent states and MaskGIT parallel decoding in transformers, surpassing human expert performance (achievement score 58.1%58.1\% vs 50.5%50.5\% human), unlocking all 22 Crafter achievements within 10M steps, and maintaining high efficiency (27 FPS on RTX 3090).
  • EDELINE (Lee et al., 1 Feb 2025): fuses diffusion-based visual prediction with linear-time sequence modeling (Mamba SSM), overcoming memory bottlenecks of fixed-context windows. It obtains 11.5 mean return at 1M steps (25% higher than DreamerV3 XL, with 11M parameters).
  • DyMoDreamer (Zhang et al., 29 Sep 2025): introduces dynamic modulation via inter-frame differencing masks, which extract pixel-level motion cues and generate dynamic modulators encoded as categorical distributions, yielding a 9.5% improvement on Crafter by enriching reward-relevant temporal information.
  • TransDreamerV3 (Dongare et al., 20 Jun 2025): replaces GRU in the RSSM with a transformer encoder, addressing memory decay and supporting parallel, long-range updates—delivering faster reward acquisition and higher metric scores on Crafter.

7. Significance and Research Impact

Crafter has served as a pivotal environment for identifying shortcomings and validating advances in sample efficiency, generalization, and hierarchical planning. Its unique achievement-centric evaluation protocol and support for both extrinsic and intrinsic learning paradigms make it a canonical benchmark for studying the emergence of generalist agent capabilities.

Key implications include:

  • The challenge of robust generalization in procedural, multimodal worlds, highlighted by CrafterOOD results.
  • The importance of memory—both architectural (recurrence, attention) and external (inventory, scene state)—for handling partial observability and long-term dependencies.
  • The continuous evolution of world modeling techniques, with improvements in tokenization, spatial latent designs, replay prioritization, and dynamic modulation yielding substantial sample efficiency and state-of-the-art returns.
  • The integration with multimodal benchmarks (e.g., instruction following), foundation models, and hierarchical architectures as a foundation for research toward general-purpose embodied agents.

Crafter continues to catalyze experimentation across exploration, unsupervised learning, vision-language grounding, and sequential planning, positioning itself as the benchmark of choice for evaluating agent generality, adaptability, and multimodal reasoning in open-ended environments.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Crafter Benchmark.