Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model (2508.13009v1)

Published 18 Aug 2025 in cs.CV

Abstract: Recent advances in interactive video generations have demonstrated diffusion model's potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.

Summary

The paper introduces Matrix-Game 2.0, a real-time video synthesis system that uses auto-regressive diffusion and few-step distillation to achieve 25 FPS.
It employs a robust data production pipeline from Unreal Engine and GTA5 along with an action injection module to enable precise, interactive control.
Experimental results show enhanced visual aesthetics, temporal consistency, and adaptability across diverse scenes, pointing to potential future improvements.

Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model

Introduction and Objectives

Matrix-Game 2.0 redefines interactive world modeling by proposing a real-time video generation framework built on auto-regressive diffusion, surpassing traditional bidirectional approaches that hinder real-time performance. The model structure incorporates a data production pipeline for massive, annotated video dataset generation, encompassing Unreal Engine and GTA5 environments. The primary challenge of simulating dynamic, real-world interactions instantaneously is tackled through the integration of action modules with few-step distillation, achieving real-time video synthesis at 25 FPS with diverse scene adaptability.

Figure 1: Real-time Interactive Generation Results demonstrating Matrix-Game 2.0's capability to generate high-quality interactive videos.

Architecture and Methodology

Matrix-Game 2.0 is anchored by three pivotal components:

Data Production Pipeline: This system generates 1200 hours of high-quality video data. Unreal Engine and GTA5 environments facilitate the collection of richly annotated datasets essential for training. By leveraging a navigation mesh-based path-planning system and quaternion precision optimization, it ensures precise data with robust alignment between visual content and control signals (Figure 2 and Figure 3).
Action Injection Module: This module empowers frame-level mouse and keyboard inputs as interactive conditions, providing precise interaction with generated contents. It heavily influences the model's ability to respond to user-driven actions, a departure from language-driven dynamics, enhancing the model's pure visual and physical reasoning (Figure 4).
Auto-Regressive Diffusion Framework: Matrix-Game 2.0 employs a self-forcing distillation approach, generating minute-level videos swiftly while ensuring temporal consistency and eliminating the latency issues associated with traditional bidirectional models. The framework leverages KV-caching for efficient sequential generation (Figure 5).
Figure 6: Pipelines of Matrix-Game 2.0.

Figure 2: Overview of Our Data Production Pipeline based on Unreal Engine.

Figure 7: Trajectory Examples of Collected Unreal Engine Data.

Figure 4: Overview of Matrix-Game 2.0 Architecture.

Figure 5: Causal Student Model Initialization via ODE Trajectories.

Experimental Results

Extensive experiments demonstrate Matrix-Game 2.0's robust performance across various domains. Its qualitative results highlight superior visual aesthetics and temporal coherence, maintaining high-quality video output over long sequences as opposed to previous models constrained by static frames after initial interactions.

Minecraft Scene Generation: Matrix-Game 2.0 outperforms the Oasis model in extending sequences, maintaining visual appeal and interaction fidelity through precise controllability of actions across diverse scenarios.

Wild Scene Generation: Encounters with out-of-domain scenes highlight significant improvements in generalization capabilities, offering stable style retention and swift generation (Figure 8, Figure 9).

Figure 8: Qualitative Comparisons on Wild Scene Generations.

Figure 9: Long Video Generations of Matrix-Game 2.0.

Limitations and Future Work

While Matrix-Game 2.0 displays commendable advancements, challenges in handling out-of-domain scenes persist, with occasional oversaturation or degradation in video quality. Enhancements in model architecture scaling and expansion of training data diversity could address these issues. Moreover, integrating explicit memory mechanisms might bolster consistency and history preservation over extended video sequences.

Figure 10: Bad cases. Matrix-Game-V2 sometimes fails when handling out-of-domain scenes, like producing over-saturated (left) or degraded (right) results.

Conclusion

Matrix-Game 2.0 represents a strategic leap in real-time interactive video generation, highlighting the transition from traditional bidirectional diffusion to a streamlined, auto-regressive paradigm. Future work focused on augmenting generalization and consistency can further establish Matrix-Game 2.0 as a foundational tool in dynamic world simulations, offering substantial applicability in gaming, autonomous systems, and augmented reality environments. The model's open-source nature encourages continued exploration and enhancement by the research community.