Crafter: 2D Survival RL Benchmark

Updated 4 July 2026

Crafter is a 2D open-world survival benchmark that integrates procedural generation, sparse rewards, and achievement-based evaluation for reinforcement learning research.
It challenges agents to develop skills in exploration, long-horizon planning, and memory through a technology tree and survival pressures in a dynamic environment.
The benchmark supports diverse research directions including object-centric generalization, world-model fidelity, and language-guided control using its rich, interpretable dynamics.

Crafter is a 2D open-world survival benchmark for reinforcement learning, introduced as a single-environment test of a broad spectrum of agent capabilities under visual input and achievement-based evaluation (Hafner, 2021). Inspired by Minecraft but deliberately simpler and faster, it combines procedural generation, sparse reward, partial observability, survival pressure, and a technology tree, so that success depends not on a single narrow skill but on exploration, memory, long-horizon reasoning, and reusable behavior (Hafner, 2021). Subsequent work has treated Crafter not only as a benchmark for reward-driven and unsupervised RL, but also as a testbed for object-centric generalization, world models, transformer memory, language-guided control, continual skill acquisition, and symbolic world modeling (Stanić et al., 2022, Micheli et al., 2024, Paglieri et al., 3 Sep 2025, Khan et al., 14 Oct 2025).

1. Benchmark rationale and research role

Crafter was proposed against a background in which widely used benchmarks tended either to evaluate one narrow skill per environment or to require training across many separate tasks. The benchmark’s stated purpose is to evaluate “a wide spectrum of general abilities within a single environment,” while remaining fast and standardized enough for efficient experimentation (Hafner, 2021). Its intended capability profile includes generalization to procedurally generated worlds, wide and deep exploration, representation learning from pixels, long-term reasoning, memory under partial observability, reusable skills, and survival under pressure (Hafner, 2021).

This design choice explains why Crafter has remained useful across method families. Model-based RL papers use it to probe memory, temporal abstraction, and world-model fidelity (Kauvar et al., 2023, Micheli et al., 2024, Burchi et al., 5 Jul 2025). OOD-generalization work uses it because the same achievement structure can be retained while object appearance or object counts are shifted (Stanić et al., 2022). LLM-agent papers use it because its goals, objects, and action dependencies can be rendered into language without eliminating the long-horizon control problem (Wu et al., 2023, Paglieri et al., 3 Sep 2025). Symbolic world-model work uses reimplementations of Crafter because its mechanics are rich enough to stress sparse rule activation and stochastic dynamics (Khan et al., 14 Oct 2025).

A persistent feature of Crafter research is that the environment is neither trivially solvable nor prohibitively complex. In the original benchmark, human experts achieved a score of 50.5 ± 6.8, while the strongest reported learned baseline, DreamerV2, reached 10.0 ± 1.2 under the 1M-step protocol (Hafner, 2021). This gap made Crafter a compact but nontrivial proxy for open-ended survival environments.

2. Environment dynamics and task structure

Crafter is a procedurally generated 2D open-world survival game. The world is a grid of $64 \times 64$ cells, with terrain including grasslands, forests, lakes, mountains, and caves (Hafner, 2021). The observation is a $64 \times 64 \times 3$ color image showing a local top-down view of the world, together with the player’s inventory and survival status along the bottom (Hafner, 2021). The action interface is a flat categorical space with 17 actions, including movement, interaction, sleep, object placement, and tool crafting (Hafner, 2021).

The survival loop is defined by four key quantities: health, food, water, and rest. Food, water, and rest decrease over time; if any of them reaches zero, the player starts losing health. Health can also be lost from monster attacks and lava, and it regenerates when the player is not hungry, thirsty, or sleepy (Hafner, 2021). This makes inaction unsafe and prevents the benchmark from collapsing into a pure collection task.

Crafter’s task structure is organized around resources and a technology tree. Resources include saplings, wood, stone, coal, iron, and diamonds. Simple resources enable intermediate tools, tables enable wood and stone tools, and furnaces enable iron tools (Hafner, 2021). The dependency structure is central: many achievements are not independent events but require ordered prerequisite chains. Later work repeatedly emphasizes this property when describing Crafter as a benchmark for long-horizon planning, memory, and structured exploration (Kauvar et al., 2023, Dongare et al., 20 Jun 2025).

The world also contains creatures such as cows, zombies, and skeletons. Zombies and cows populate grasslands, skeletons live in caves, and nighttime reduces visibility while increasing zombie pressure, so shelter and defensive behavior matter (Hafner, 2021). The original paper reports emergent behaviors after longer training, including tunnel systems, bridges, houses, plantations, arrow dodging, blocking with stones, digging through walls, hiding in caves at night, and building self-sustaining food systems (Hafner, 2021). These observations are significant because they indicate that Crafter can elicit coherent multi-step strategies rather than isolated reflexes.

3. Achievement-based evaluation and original baselines

Crafter’s central evaluative device is its set of 22 semantically meaningful achievements (Hafner, 2021). Achievements correspond to events such as discovering resources, crafting tools, placing objects, defeating monsters, surviving sleep, growing plants, and ultimately collecting diamond (Hafner, 2021). Because these milestones have clear semantic content, the benchmark supports debugging and capability analysis at a finer granularity than cumulative return alone.

The reward function is sparse and achievement-centric. The environment gives $+1$ whenever an achievement is unlocked for the first time during the current episode, $-0.1$ for every health point lost, and $+0.1$ for every health point regenerated (Hafner, 2021). Since maximum health is 9, the health-based shaping affects only the first decimal place of episode return, so ceilinging the return yields the number of achievements unlocked during the episode (Hafner, 2021).

The benchmark’s summary metric is the geometric mean over per-achievement success rates:

$S \doteq \exp\!\left(\frac{1}{N}\sum_{i=1}^{N}\ln(1+s_i)\right)-1,$

where $s_i \in [0;100]$ is the success rate of achievement $i$ and $N=22$ (Hafner, 2021). This score rewards breadth: improving a rare hard achievement matters more than marginally improving an already common easy one (Hafner, 2021). The standard protocol in the original paper allocates 1M environment steps per agent (Hafner, 2021).

The original benchmark results established both difficulty and headroom:

Method	Score
Human Experts	50.5 ± 6.8
DreamerV2	10.0 ± 1.2
PPO	4.6 ± 0.3
Rainbow	4.3 ± 0.2
Plan2Explore	2.1 ± 0.1
RND	2.0 ± 0.1
Random	1.6 ± 0.0

These numbers show that reward-driven RL substantially outperformed unsupervised baselines under the original 1M-step budget, but still remained far below human experts (Hafner, 2021). They also established Dreamer-style model-based RL as a strong reference point for later Crafter work.

4. Variants, derivatives, and benchmark reimplementations

Crafter has been extended in several directions to test questions that the original benchmark leaves implicit. One line concerns OOD generalization. The paper on object-centric agents introduces CrafterOOD, a set of 15 new environments divided into CrafterOODapp for appearance shifts and CrafterOODnum for changes in object counts (Stanić et al., 2022). Under this protocol, tuned PPO baselines substantially improved on earlier Crafter results, with LSTM-SPCNN reaching 12.1 ± 0.8 at 1M steps, while object-centric agents, especially OC-SA, achieved the strongest OOD generalization and remained interpretable through attention maps (Stanić et al., 2022). The same study also showed that training beyond 1M steps changes the picture materially: PPO-SPCNN reached 30.5 by 20M steps, and tuned agents could unlock nearly all achievements (Stanić et al., 2022).

A second line concerns compute and scaling. Craftax-Classic is a ground-up JAX rewrite of Crafter intended to preserve the original dynamics while being much faster; it runs up to about 250× faster than the Python-native implementation, with a reported best-case comparison of 405,618 steps/sec versus 1,580 steps/sec (Matthews et al., 2024). The same paper argues that the original 1M-step evaluation protocol pushes Crafter toward sample-efficiency testing rather than open-ended learning, and therefore introduces Craftax, a much richer benchmark with 9 procedurally generated floors, 19 creatures, 43 actions, 65 achievements, and a maximum episode length of 100,000 timesteps (Matthews et al., 2024). Craftax is not Crafter itself, but it is explicitly presented as a descendant benchmark motivated by Crafter’s strengths and limitations.

A third line reinterprets Crafter for symbolic modeling. Crafter-OO is a reimplementation in which the environment is exposed as a structured, object-oriented symbolic state with a pure transition function $T : \mathcal{S} \times \mathcal{A} \to \Delta(\mathcal{S})$ (Khan et al., 14 Oct 2025). This reformulation supports evaluation of executable symbolic world models on state ranking and state fidelity rather than reward maximization (Khan et al., 14 Oct 2025). A plausible implication is that Crafter’s underlying mechanics are structured enough to support both pixel-based and symbolic research programs.

Together, these variants show that Crafter functions not only as a fixed benchmark, but also as a design nucleus from which faster, harder, more symbolic, and more OOD-sensitive environments can be derived.

5. World models, replay, and structured exploration on Crafter

Crafter has been especially important in work on model-based RL, replay prioritization, and structured exploration because its achievement chains and partial observability expose weaknesses in compressed state representations and uniform replay. A representative example is Curious Replay, a prioritized replay method for Dreamer-style agents. On Crafter, DreamerV3 + Curious Replay achieved 19.4 ± 1.6, compared with 14.5 ± 1.6 for DreamerV3 with uniform replay, and the paper reports faster progression through prerequisite-heavy achievements such as Make Wood Pickaxe, Collect Stone, Make Stone Pickaxe, and Collect Iron (Kauvar et al., 2023). The authors’ interpretation is that replay should focus on newly discovered, poorly predicted, and under-replayed experiences, which is especially important when the agent’s world changes effectively as new resources and affordances become available (Kauvar et al., 2023).

Structured exploration work makes a complementary point. SEA treats Crafter as an achievement-based domain whose internal dependency graph can be learned and exploited. On the modified Crafter version used in that paper—where the environment has 21 achievements, health reward is removed, and the episode terminates if no achievement has been unlocked in the last 100 steps—SEA achieved a Crafter score of 75.52 (2.36) and a collect_diamond unlock rate of 4.21%, while baselines such as IMPALA, PPO, DreamerV2, RND, and HAL had 0.00% on the hard-set mean or on diamond collection in the reported table (Zhou et al., 2023). Because this protocol modifies the environment and evaluation assumptions, these numbers are not numerically interchangeable with the original 22-achievement benchmark. This suggests that Crafter has also become a platform for research on learning latent task structure rather than only optimizing flat reward.

Transformer-based and high-fidelity world models have produced another sequence of Crafter results. $64 \times 64 \times 3$ 0-IRIS reports a Crafter score of 42.47 at 10M frames, solves 17 out of 22 tasks, and is described as an order of magnitude faster to train than previous attention-based approaches, largely by modeling stochastic deltas and summarizing state with continuous tokens (Micheli et al., 2024). EMERALD then reports Score: 58.1%, Return: 16.8 ± 0.6, 30M parameters, and 27 FPS, and is described as the first method to surpass human experts within 10M environment steps; it also unlocks all 22 achievements at least once during evaluation over 256 episodes (Burchi et al., 5 Jul 2025). Both papers present Crafter as a stress test for perceptual fidelity, temporal memory, and efficient imagination in partially observable survival tasks.

A related but more targeted architectural intervention appears in TransDreamerV3, which replaces DreamerV3’s GRU-based deterministic latent model with a transformer encoder. On Crafter, the paper states that TransDreamerV3 significantly outperforms DreamerV3, while still underperforming TransDreamer, and attributes the remaining gap to a simplified transformer implementation that does not fully utilize the context of previous states (Dongare et al., 20 Jun 2025). The Crafter result is reported qualitatively through learning curves rather than a formal metric table, but the paper consistently frames the gain as evidence that direct attention over latent history helps in memory- and planning-intensive tasks (Dongare et al., 20 Jun 2025).

6. Language-guided agents, programmatic skills, and embodied foundation models

Crafter has also become a benchmark for agents whose control loop is mediated by language, explicit planning, or executable programs. SPRING is the clearest zero-training example: it reads the LaTeX source of the Crafter paper, extracts game-relevant knowledge, reasons with a question-answer DAG, and acts from a textualized visual description (Wu et al., 2023). Under this setup, SPRING + paper with GPT-4 achieved 27.3 ± 1.2% score and 12.3 ± 0.7 reward with 0 training steps, outperforming all listed RL baselines trained for 1M steps in that comparison (Wu et al., 2023). The result is notable because it reframes Crafter as a benchmark not only for policy learning from interaction, but also for knowledge extraction and structured reasoning from documentation.

Hierarchical and planning-based LLM systems use Crafter differently. LLM Augmented Hierarchical Agents treats it as a “2D version of Minecraft” with a natural skill hierarchy; the high-level policy selects among pretrained RL skills, and an LLM supplies annealed common-sense priors through

$64 \times 64 \times 3$ 1

On the Crafter tasks reported in the figure, the paper states that the method performs better than the baseline HRL method and that the trained policy no longer needs the LLM at deployment because $64 \times 64 \times 3$ 2 is annealed to zero (Prakash et al., 2023).

Learning When to Plan turns Crafter into a testbed for dynamic test-time compute allocation in sequential decision-making. In that work, the environment is wrapped by BALROG, the agent emits natural-language action commands, and planning is represented by an optional <plan>...</plan> block (Paglieri et al., 3 Sep 2025). The paper reports a non-monotonic dependence on planning frequency, with performance peaking at an intermediate frequency, “e.g. every 4 steps in Crafter,” while always planning underperforms due to instability and token cost (Paglieri et al., 3 Sep 2025). After supervised fine-tuning and PPO, the dynamic planner becomes more sample-efficient than its non-planning counterpart in early and middle training and can be steered by human-written plans all the way to collect diamond, an achievement described as unseen in autonomous training runs (Paglieri et al., 3 Sep 2025).

Programmatic agents push the same benchmark in a more symbolic direction. Programmatic Skill Network (PSN) models skills as executable symbolic programs and applies continual repair, maturity-aware update gating,

$64 \times 64 \times 3$ 3

with $64 \times 64 \times 3$ 4 and $64 \times 64 \times 3$ 5, and structural refactoring under rollback validation (Shi et al., 7 Jan 2026). On Crafter, the paper states that PSN consistently achieves the highest cumulative reward and the most stable learning curve among the compared methods, and explicitly interprets this as evidence that its mechanisms generalize “beyond sparse, long-horizon tasks to dense-reward continual learning settings” (Shi et al., 7 Jan 2026).

Finally, CrafterDojo attempts to provide Crafter with a Minecraft-like foundation-model ecosystem. It introduces CrafterVPT, CrafterCLIP, and CrafterSteve-1, together with the CrafterPlay and CrafterCaption datasets and data-generation toolkits (Park et al., 19 Aug 2025). CrafterPlay contains 20,000 episodes and about 180M timesteps, while CrafterCaption is built from rule-based trajectory labeling and paraphrased caption augmentation (Park et al., 19 Aug 2025). In the reported retrieval benchmark, CrafterCLIP achieves R@1 = 89.8%, R@5 = 96.1%, R@10 = 90.6%, and MeanR = 1.4, far above a WebVid-trained CLIP4Clip baseline (Park et al., 19 Aug 2025). The same paper argues that hierarchical composition of behavior priors, vision-language grounding, and instruction-following controllers makes Crafter a lightweight, prototyping-friendly environment for open-ended embodied-agent research (Park et al., 19 Aug 2025).

Crafter’s later history therefore shows a notable shift. It began as a compact visual RL benchmark, but it now also supports research on dynamic planning, language-grounded control, executable symbolic skills, and foundation-model-based embodied agents. That evolution reflects the original benchmark design: a procedurally generated survival world whose semantics are rich enough to sustain many methodological interpretations, but controlled enough to keep those interpretations experimentally tractable.