Training Agents Inside of Scalable World Models

This presentation explores Dreamer 4, a breakthrough in reinforcement learning that trains agents entirely within learned world models. The system achieves the remarkable feat of obtaining diamonds in Minecraft using only offline data—no environment interaction—while using 100 times less data than prior approaches. By combining efficient transformer architectures with a novel shortcut forcing objective, Dreamer 4 enables real-time interactive simulation and demonstrates that vast unlabeled video can be leveraged for agent training with minimal action-labeled data.
Script
In Minecraft, obtaining diamonds requires executing over 20,000 precise actions in sequence—mining trees, crafting tools, descending into caves, and navigating deadly lava. Dreamer 4 accomplishes this entire chain without ever touching the actual game, training exclusively inside a learned simulation of the Minecraft world.
The breakthrough comes from shortcut forcing, a training objective that collapses 64 denoising steps into just 4, without sacrificing fidelity. This efficiency unlocks real-time interactive play inside the world model, processing contexts spanning nearly 10 seconds of gameplay on a single graphics card.
How does the system achieve both speed and accuracy?
The transformer architecture makes strategic trade-offs: separating spatial and temporal attention, applying time-based processing sparsely, and compressing memory usage. The three-phase training pipeline starts with vast unlabeled video, adds a small amount of action supervision, then refines the policy purely through imagined experience.
The results speak directly to capability. Dreamer 4 achieves a 0.7% success rate at obtaining diamonds—modest in absolute terms, but the first offline system to reliably complete this 20,000-action chain. Early milestones like crafting stone tools succeed over 90% of the time, with imagination training consistently outperforming pure imitation.
Perhaps more striking than task performance is data efficiency. With just 100 hours of action-labeled video, the model recovers 85% of the quality achieved with full supervision—a 400-fold reduction. Actions learned in the Overworld transfer to alien dimensions like the Nether without explicit training, and the model supports true interactive play, accurately simulating complex mechanics that competing systems hallucinate or fail to maintain.
Dreamer 4 demonstrates that agents can learn long-horizon behavior inside learned simulations, using predominantly unlabeled video and minimal environment interaction. The diamond challenge, once requiring millions of YouTube frames, now yields to 2,500 hours of offline contractor data and imagination alone. Visit EmergentMind.com to explore this research further and create your own presentation videos.