MineWorld: Real-Time Minecraft World Model

Updated 29 March 2026

MineWorld is a real-time, open-source interactive world model for Minecraft that unifies perceptual and action dynamics in a single autoregressive framework.
It employs a unified visual–action Transformer with specialized tokenizers for image and action inputs, achieving efficient diagonal-parallel decoding.
The model provides strong action control metrics and serves as a baseline in research on spatial consistency and controllable video generation despite limitations in long-term memory.

MineWorld is a real-time, open-source interactive world model for Minecraft, providing a unified framework for world simulation and policy learning. Unlike classical simulation environments or pure video prediction models, MineWorld models both perceptual and action dynamics as a single, token-based autoregressive process, achieving high frame rates and strong action-controllability in an open-ended, sandbox context. It is widely referenced as a baseline in research on world models, spatial consistency, controllable video generation, foundation models, and embodied agent evaluation within Minecraft-like environments (Guo et al., 11 Apr 2025).

1. Visual–Action Autoregressive Transformer Architecture

MineWorld employs a single, unified visual–action autoregressive Transformer to simulate forward dynamics. At time step $i$ , the system receives an RGB Minecraft frame $x_i$ and a user action $a_i$ (mouse/keyboard inputs). Both modalities are converted to sequences of discrete token IDs using specialized tokenizers—a fine-tuned VQ-VAE (for vision) and quantization/discretization for actions (mouse angles, keypresses).

The input to the Transformer is a concatenated sequence $t = [t_1, t_2, \dots, t_N]$ , with interleaved vision and action tokens across successive steps: $f_\theta(t) = \prod_{j=1}^N p_\theta(t_j \mid t_{<j}),$ where some $t_j$ are image patches, and others are quantized actions. The Transformer is trained using a next-token cross-entropy loss: $\mathcal{L}(\theta) = -\sum_{j=1}^N \log p_\theta(t_j \mid t_{<j}).$ This formulation couples the modeling of state transitions ( $x_{i+1}$ given $(x_{\leq i}, a_i)$ ) and action prediction (policy modeling of $a_{i+1} \mid x_{\leq i}$ ) (Guo et al., 11 Apr 2025).

2. Tokenization Schemes and Input Structure

The image tokenizer employs a VQ-VAE (initialized from Amused), compressing $x_i$ 0 RGB frames into $x_i$ 1 tokens per frame (16 $x_i$ 2 spatial compression). The action tokenizer quantizes camera rotations (11 bins per axis) and encodes keypresses into 7 classes, with boundary tokens ([aBOS], [aEOS])—resulting in 11 tokens per action.

At each timestep, image and action tokens are interleaved: $x_i$ 3 For 16 time steps, the total input is approximately $x_i$ 4 tokens. This discrete, densely interleaved sequence enables the Transformer to learn complex dependencies between perceptual and control states within a fixed-length context ( $x_i$ 5 maximum).

3. Inference: Diagonal Parallel Decoding Algorithm

Autoregressive decoding of high-resolution video tokens is computationally expensive, leading to low practical frame rates. MineWorld introduces a diagonal-parallel decoding algorithm: each frame’s token grid (of size $x_i$ 6) is filled diagonal-by-diagonal, predicting all tokens along each diagonal in parallel. Pseudocode:

$t = [t_1, t_2, \dots, t_N]$ 5 The speedup relative to raster scan is $x_i$ 7; MineWorld’s $x_i$ 8 grid yields $x_i$ 9 versus sequential. Empirical frame rates range from $a_i$ 0 FPS (300M params) to $a_i$ 1 FPS (1.2B params), substantially exceeding standard autoregressive baselines (Guo et al., 11 Apr 2025).

4. Evaluation Protocols and Controllability Metrics

Beyond classical video-generation criteria (FVD, PSNR, SSIM, LPIPS), MineWorld introduces metrics for action fidelity:

Inverse Dynamics Model (IDM): Given a frame pair $a_i$ 2, a pretrained IDM predicts the underlying action $a_i$ 3. Discrete action classification F1, precision, recall (grouped into seven and four binary action tasks) are macro-averaged.
Camera Movement L1 Loss: For camera controls, the L1 distance is computed between predicted and true quantization bins:

$a_i$ 4

Empirical results show MineWorld (1.2B) improving FVD from $a_i$ 5, F1 from $a_i$ 6, and L1 camera loss from $a_i$ 7 compared to the diffusion-based Oasis baseline (Guo et al., 11 Apr 2025).

5. Memory and Spatial Consistency: Current Limitations

MineWorld, while effective for short sequences and local controllability, retains only a sliding window (15 frames) of context. Benchmarking on LoopNav (Lian et al., 29 May 2025) shows sharp degradation in loop-closure spatial consistency: even at minimal navigation ranges ( $a_i$ 8), SSIM drops to $a_i$ 9, LPIPS is $t = [t_1, t_2, \dots, t_N]$ 0, and FVD for return segments reaches $t = [t_1, t_2, \dots, t_N]$ 1. Without explicit long-horizon memory (such as key–value storage, map memory, or learned room embeddings), the model “hallucinates” scene elements rather than returning to consistent renderings upon revisiting locations.

Recommendations include augmenting with coordinate-keyed external memory, hybrid egocentric map-latent memory, or clustering and attending to “room” embeddings, thereby supporting true loop closure and planning-ready simulation (Lian et al., 29 May 2025).

6. Position Relative to Contemporary Foundation and Multiplayer Models

Subsequent models such as Matrix-Game (17B parameters, 2.7K hours data) employ two-stage training (unlabeled pretraining, action-labeled controllable generation), latent diffusion, and advanced action-conditional DiTs, significantly surpassing MineWorld in both controllability (Keyboard Acc $t = [t_1, t_2, \dots, t_N]$ 2, Mouse Acc $t = [t_1, t_2, \dots, t_N]$ 3) and physical consistency (Obj. Consist. $t = [t_1, t_2, \dots, t_N]$ 4) under the GameWorld Score suite (Zhang et al., 23 Jun 2025). Multiplayer extensions (Solaris) add cross-agent memory, grounding, and view coherence via multiplayer self-attention, staged causal/bidirectional/self-forcing training, and a data collection framework for coordinated episodes (Savva et al., 25 Feb 2026). These advances highlight MineWorld’s strengths in real-time single-agent modeling, while exposing its limitations in long-horizon memory and multi-agent environments.

7. Impact, Open Source Contributions, and Future Directions

MineWorld’s key contributions to the field are:

The first open-sourced, real-time, visual–action world model for Minecraft, unifying video and control streams in a single autoregressive Transformer
A diagonal parallel decoding strategy enabling frame rates of 4–7 FPS, with practical support for real-time interactive use
Introduction of action-following controllability metrics alongside perceptual ones, establishing a new evaluation protocol in open-ended simulated world tasks

All code, three pretrained checkpoints (300M, 700M, 1.2B), and data preparation are released publicly (Guo et al., 11 Apr 2025).

Given the current absence of explicit memory, research converges on the necessity of spatially-grounded memory modules for robust planning, spatial consistency, and reliable open-world simulation. This has set the roadmap for successor models (including MineWorld 2.0), aiming at multi-agent, infinite-horizon, and semantically structured world modeling (Lian et al., 29 May 2025, Zhang et al., 23 Jun 2025, Savva et al., 25 Feb 2026).