CrafterDojo: Scalable Embodied AI Testbed

Updated 8 March 2026

CrafterDojo is a suite of foundation models, automated data pipelines, and reference agents tailored to transform the Crafter testbed for embodied AI research.
It enables rapid prototyping by offering fast, lightweight simulation and reduced engineering overhead compared to Minecraft-based systems.
The framework integrates specialized modules—CrafterVPT, CrafterCLIP, and CrafterSteve-1—with extensive benchmarks for robust, reproducible research.

CrafterDojo is a suite of foundation models, automated data pipelines, agent implementations, benchmarks, and an open-source codebase specifically designed to transform the Crafter environment into a scalable, efficient, and Minecraft-like testbed for research in general-purpose embodied agents. Developed in response to limitations in the prevailing Minecraft-based workflow for embodied intelligence—including slow simulation and high engineering overhead—CrafterDojo introduces foundation-model workflows akin to those dominating modern Minecraft agent research, but adapted for the lightweight, fully Python-based Crafter domain. Its primary components encompass CrafterVPT for behavioral priors, CrafterCLIP for vision–language grounding, CrafterSteve-1 for instruction following, and a comprehensive set of tools and reference agents, all supported by robust datasets and data-generation pipelines (Park et al., 19 Aug 2025).

1. Motivation and Rationale

The predominant paradigm in contemporary embodied-agent research involves pretraining large-scale foundation models—spanning behavioral and vision-language priors, as well as instruction-following models—on diverse demonstration or video–text datasets. Within this framework, Minecraft has emerged as a canonical environment due to its complexity and the abundance of internet-scale data, driving the development of influential foundation models such as VPT, MineCLIP, and Steve-1. However, Minecraft’s heavy resource demands, slow simulation speed, and restricted modifiability hinder rapid prototyping and system-level experimentation.

Crafter, in contrast, offers a 2D, grid-based, Python-native analog to Minecraft that retains key open-ended challenges (procedural generation, resource collection, survival, combat, and crafting) but executes orders of magnitude faster and with lower complexity for modification. Prior research on Crafter was constrained to end-to-end RL targeting narrow tasks, primarily due to the lack of large-scale expert data and pretrained foundation models, which foreclosed the transfer of Minecraft-style workflows.

CrafterDojo was designed to address this deficit by providing:

A full triad of Minecraft-style foundation models (CrafterVPT as behavioral prior, CrafterCLIP for vision–language embedding, CrafterSteve-1 for instruction following),
Automated, scalable pipelines for behavior and caption data (CrafterPlay and CrafterCaption),
Reference agent baselines covering multiple paradigms,
Extensive benchmarks for behavioral and compositional agent evaluation,
A complete open-source repository for transparent, reproducible research.

2. Model Architecture and Core Components

2.1 CrafterVPT (C-VPT): Behavioral Policy Pretraining

CrafterVPT is a foundation model for behavioral cloning of expert policies in Crafter. Its architecture is as follows:

Input: Pixel observations $o_t \in \mathbb{R}^{3 \times 144 \times 144}$
Image encoder: ResNet maps $o_t$ to visual tokens $x_t$
Temporal model: Transformer-XL aggregates token streams $x_{1:t} \to \tilde{x}_{1:t}$
Policy head: Outputs a categorical distribution over 17 discrete actions $\pi_\theta(a_t | \tilde x_{1:t})$

The loss function employed is the negative log-likelihood over action-labeled expert trajectories: $\mathcal{L}_{\rm cvpt} = \mathbb{E}_{(o_{1:t},a_t)\sim\mathcal D_{\rm play}} \bigl[-\log \pi_\theta(a_t\,|\,o_{1:t})\bigr]$

CrafterVPT is trained using the CrafterPlay dataset, comprising 20,000 expert PPO-RNN policy trajectories (total ≈180M frames), filtered to exclude short no-op segments.

2.2 CrafterCLIP (C-CLIP): Vision–Language Grounding

CrafterCLIP provides grounded, domain-specific vision–language retrieval:

Video encoder: Frame-wise ResNet features followed by a Transformer for aggregating 6-frame video segments.
Text encoder: CLIP-Transformer (ViT) text encoder.

Training employs a contrastive video–text loss over paraphrased caption sets: $\mathcal{L}_{\rm cclip} = -\sum_{b=1}^B \log\frac{\exp\bigl(\mathrm{sim}(E_V(\mathbf{o}_b),\,E_T(c'_b))\bigr)}{\sum_{k=1}^B\exp\bigl(\mathrm{sim}(E_V(\mathbf{o}_k),\,E_T(c'_k))\bigr)}$ where for each template, ≈40 paraphrase captions are generated via LLM prompting, totaling about 2.44M captions across 61 action templates.

2.3 CrafterSteve-1 (C-Steve-1): Instruction-Following Policy

CrafterSteve-1 extends CrafterVPT, integrating CrafterCLIP goal embeddings for instruction conditioning: $\begin{aligned} & x_t = \mathrm{ResNet}_\theta(o_t)\ & \tilde x_{1:t} = \mathrm{TrXL}_\theta(x_{1:t})\ & \tilde x'_{1:t} = \tilde x_{1:t} + W_\theta\,z_{1:t}+b_\theta\ & a_t\sim\pi_\theta(a_t\,|\,\tilde x'_{1:t}) \end{aligned}$ The goal embedding $z_{1:t}$ is sourced from C-CLIP, and the “Head Conditioning” mechanism is empirically preferred. Training relies on event-based hindsight relabelling from CrafterPlay. For natural language goals at inference, a CVAE prior trained on 120k video–text pairs generates goal embeddings: $\mathcal{L}_{\rm prior} = \mathbb{E}_{(z_v,z_t)} \Bigl[\mathrm{KL}\bigl(q_\phi(z_v|z_t)\,\|\,p(z_v)\bigr) -\mathbb{E}_{c\sim q_\phi}[\log p_\phi(z_v|c,z_t)]\Bigr]$ Classifier-free guidance is used at inference with optimal guidance scale λ ≈ 1.5.

3. Automated Data Pipelines and Toolkit Infrastructure

3.1 CrafterPlay

CrafterPlay is the data-generation pipeline for expert behavioral trajectories. An expert PPO-RNN policy is trained for 10B steps (97.5% Crafter Score, 98.4% normalized return) and deployed to collect 20,000 episodes of ≈9,012 steps, with selective no-op filtering reducing idle frames from 60% to 4.6%.

3.2 CrafterCaption

CrafterCaption generates dense, aligned video–caption pairs based on a rule engine derived from 15 environment event types (61 template captions). Each step in CrafterPlay is scanned for rule matches, with matched frame segments paired with captions. Linguistic diversity is induced via an LLM, generating 40 paraphrases per template for ≈2.44M total captions.

4. Reference Agent Libraries

CrafterDojo includes a set of agents designed to showcase use cases of its foundation models:

C-VPT-RL: Fine-tunes CrafterVPT with PPO+LoRA and KL constraint, probing the sufficiency of behavioral priors for multi-step tasks.
C-CLIP-Prompt: Uses C-CLIP to retrieve the most relevant demonstration for a goal, providing its embedding to C-Steve-1.
PPO-Steve (Hierarchical): High-level PPO-based planner chooses among 61 declared captions every 10 steps; low-level C-Steve-1 executes the chosen instruction.
Heuristic-Steve: Employs a deterministic inventory-based heuristic planner for sequential instruction generation, executed by C-Steve-1.

5. Benchmarking and Empirical Results

Table: Foundation Model Performance Metrics

Model	Crafter Score (%)	Normalized Return (%)	Task/Metric
Expert Policy (PPO-RNN)	97.5	98.4	Baseline
C-VPT (base, 15.9M params)	61.0 ± 3.0	71.8 ± 0.1	Behavioral prior
C-VPT (large)	61.4 ± 4.7	71.3 ± 0.1	Behavioral prior
Dedieu et al. (2025)	31.8	69.7	Baseline
C-CLIP (Recall@1)	89.8	—	Retrieval
C-CLIP (MeanRank)	1.4	—	Retrieval
CLIP4Clip (Recall@1)	1.7	—	Retrieval (WebVid)

CrafterDojo foundation models substantially outperform prior published approaches. For behavioral priors, C-VPT nearly doubles the Crafter Score compared to Improved Transformer World Models (Dedieu et al.). C-CLIP attains Recall@1 of 89.8% on domain-specific retrieval tasks versus 1.7% for generic CLIP4Clip. C-Steve-1 achieves near-100% success on all benchmark single-instruction goals within 10–20 steps (compared to ≤50% for unconditional C-VPT), and PPO-Steve demonstrates robust task compositionality on multi-step, sparse-reward tasks (≥80% success on most cases).

Ablations demonstrate that expert training length and dataset scale yield monotonic improvements up to 10B steps and 18k episodes, that paraphrase count for captions is optimal at N≈40, and that event-based hindsight relabelling is robust for both short- and long-horizon tasks.

6. Open-Source Codebase and Reproducibility

CrafterDojo is distributed as an open-source repository (https://github.com/frechele/CrafterDojo) with modular directories for all core components, pipelines, and agent behaviors. All pretrained model weights, data generation scripts, and evaluation code are available. Standard Python installation is supported:

$o_t$ 0 Example scripts are provided for rapid training and evaluation of both tiny and large model variants, with the ability to bypass compute-intensive data collection via pretrained asset downloads. All scripts are documented to facilitate rapid prototyping and extension.

7. Significance and Impact

CrafterDojo provides a “Minecraft-style” foundation-model pipeline in the lightweight Crafter domain, delivering key innovations including modular model architectures, scalable automated data pipelines, compositional agent frameworks, and extensive benchmarks. It enables embodied AI researchers to iterate on general-purpose, open-ended agent approaches efficiently and at scale, mitigating the prohibitive resource and engineering overhead associated with prior Minecraft-centric workflows (Park et al., 19 Aug 2025). The open, well-documented ecosystem positions CrafterDojo as a rapid-prototyping substrate and a proxy environment for research intended for later transfer to higher-complexity 3D worlds.

Markdown Report Issue Upgrade to Chat

References (1)

CrafterDojo: A Suite of Foundation Models for Building Open-Ended Embodied Agents in Crafter (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CrafterDojo.

CrafterDojo: Scalable Embodied AI Testbed

1. Motivation and Rationale

2. Model Architecture and Core Components

2.1 CrafterVPT (C-VPT): Behavioral Policy Pretraining

2.2 CrafterCLIP (C-CLIP): Vision–Language Grounding

2.3 CrafterSteve-1 (C-Steve-1): Instruction-Following Policy

3. Automated Data Pipelines and Toolkit Infrastructure

3.1 CrafterPlay

3.2 CrafterCaption

4. Reference Agent Libraries

5. Benchmarking and Empirical Results

6. Open-Source Codebase and Reproducibility

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CrafterDojo: Scalable Embodied AI Testbed

1. Motivation and Rationale

2. Model Architecture and Core Components

2.1 CrafterVPT (C-VPT): Behavioral Policy Pretraining

2.2 CrafterCLIP (C-CLIP): Vision–Language Grounding

2.3 CrafterSteve-1 (C-Steve-1): Instruction-Following Policy

3. Automated Data Pipelines and Toolkit Infrastructure

3.1 CrafterPlay

3.2 CrafterCaption

4. Reference Agent Libraries

5. Benchmarking and Empirical Results

6. Open-Source Codebase and Reproducibility

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research