Matrix-Game: Interactive World Foundation Model (2506.18701v1)

Published 23 Jun 2025 in cs.CV and cs.AI

Abstract: We introduce Matrix-Game, an interactive world foundation model for controllable game world generation. Matrix-Game is trained using a two-stage pipeline that first performs large-scale unlabeled pretraining for environment understanding, followed by action-labeled training for interactive video generation. To support this, we curate Matrix-Game-MC, a comprehensive Minecraft dataset comprising over 2,700 hours of unlabeled gameplay video clips and over 1,000 hours of high-quality labeled clips with fine-grained keyboard and mouse action annotations. Our model adopts a controllable image-to-world generation paradigm, conditioned on a reference image, motion context, and user actions. With over 17 billion parameters, Matrix-Game enables precise control over character actions and camera movements, while maintaining high visual quality and temporal coherence. To evaluate performance, we develop GameWorld Score, a unified benchmark measuring visual quality, temporal quality, action controllability, and physical rule understanding for Minecraft world generation. Extensive experiments show that Matrix-Game consistently outperforms prior open-source Minecraft world models (including Oasis and MineWorld) across all metrics, with particularly strong gains in controllability and physical consistency. Double-blind human evaluations further confirm the superiority of Matrix-Game, highlighting its ability to generate perceptually realistic and precisely controllable videos across diverse game scenarios. To facilitate future research on interactive image-to-world generation, we will open-source the Matrix-Game model weights and the GameWorld Score benchmark at https://github.com/SkyworkAI/Matrix-Game.

Summary

The paper presents a 17B-parameter latent diffusion model that leverages a 3D Causal VAE and multi-modal transformer for controllable, long-form video generation in Minecraft.
The paper constructs the Matrix-Game-MC dataset with over 2,700 hours of unlabeled and 1,000 hours of fine-grained action-labeled gameplay, ensuring diverse and robust training data.
The paper introduces the GameWorld Score benchmark, offering unified evaluation across visual, temporal, controllability, and physical consistency dimensions to validate model performance.

Matrix-Game: An Interactive World Foundation Model for Controllable Game World Generation

Matrix-Game introduces a large-scale, action-controllable world model for interactive video generation in open-ended environments, with a primary focus on Minecraft. The work addresses key challenges in world modeling: scalable data acquisition, fine-grained controllability, and unified evaluation. The authors present three main contributions: (1) the Matrix-Game-MC dataset, (2) the Matrix-Game model architecture, and (3) the GameWorld Score benchmark.

Dataset Construction: Matrix-Game-MC

The dataset is constructed to support both environment understanding and action-conditioned generation. It comprises over 2,700 hours of unlabeled Minecraft gameplay videos and more than 1,000 hours of high-quality, action-labeled clips with fine-grained keyboard and mouse annotations. The data curation pipeline employs hierarchical filtering for video quality, aesthetics, menu-state, subtitles, human faces, motion, and camera movement, ensuring high-fidelity and informative training samples.

For labeled data, the authors combine trajectories from MineRL-based exploration agents and procedurally generated Unreal Engine environments. The action-labeled data is carefully balanced across 14 Minecraft biomes, with explicit constraints on camera motion and engine modifications to ensure temporal and spatial consistency. This results in a dataset that is both semantically diverse and structurally robust, supporting generalization across a wide range of scenarios.

Model Architecture and Training

Matrix-Game is a 17B-parameter latent diffusion model operating in a spatiotemporally compressed latent space via a 3D Causal VAE. The model adopts an image-to-world generation paradigm, conditioning on a single reference image, motion context, and user actions (keyboard and mouse). The architecture is built around a multi-modal diffusion transformer (MMDiT), with a dedicated action control module for frame-level conditioning.

Key architectural features include:

Autoregressive Generation: The model generates long videos by conditioning each segment on the last $k$ frames of the previous segment, concatenated with a binary mask to indicate valid motion information. This design maintains local temporal consistency and mitigates error accumulation.
Action Injection: Discrete keyboard actions and continuous mouse movements are encoded and aligned with latent tokens using group operations and cross-attention mechanisms. Classifier-free guidance is applied to both motion and action signals during training to improve robustness.
Flow Matching Training: The model leverages the rectified flow loss for efficient convergence and sampling, outperforming traditional DDPMs in both speed and stability.

Training proceeds in two stages: (1) large-scale pretraining on unlabeled data for world understanding, and (2) action-labeled fine-tuning for controllable generation. The model is initialized from HunyuanVideo I2V weights, with the text branch replaced by an image branch to focus on visual grounding.

GameWorld Score: Unified Benchmark

The GameWorld Score benchmark is introduced to provide a comprehensive, multi-dimensional evaluation of world models in interactive settings. It decomposes performance into eight dimensions across four pillars:

Visual Quality: Frame-wise image quality (MUSIQ) and aesthetic appeal (LAION predictor).
Temporal Quality: Temporal consistency (CLIP feature similarity) and motion smoothness (frame interpolation error).
Action Controllability: Keyboard and mouse accuracy, measured via an Inverse Dynamics Model trained on large-scale Minecraft data.
Physical Rule Understanding: Object consistency (DROID-SLAM reprojection error) and scenario consistency (MSE under symmetric camera motions).

This benchmark enables rigorous, standardized comparison of models in terms of perceptual fidelity, controllability, and physical plausibility.

Experimental Results

Matrix-Game demonstrates strong empirical performance across all GameWorld Score dimensions, outperforming prior open-source baselines (OASIS, MineWorld) in both quantitative metrics and double-blind human evaluations. Notably:

Action Controllability: Achieves >88% accuracy on all keyboard actions and >89% on all mouse directions, with particularly high precision in fine-grained controls.
Physical Consistency: Substantial improvements in object and scenario consistency, indicating better modeling of spatial and temporal coherence.
Visual and Temporal Quality: Higher image quality and aesthetic scores, with smooth, flicker-free motion across long sequences.

The model generalizes robustly across diverse Minecraft biomes and procedurally generated Unreal Engine scenarios, maintaining high controllability and visual fidelity.

Implementation Considerations

Computational Requirements: Training a 17B-parameter model with high-resolution (720p) video data necessitates significant GPU resources, mixed-precision training, and distributed data parallelism (FSDP).
Data Curation: The hierarchical filtering pipeline and action-labeling strategies are critical for dataset quality and model generalization.
Inference: The autoregressive generation strategy enables long-form video synthesis, but care must be taken to manage error accumulation and maintain temporal consistency.
Deployment: The model and benchmark are open-sourced, facilitating reproducibility and further research in interactive world modeling.

Limitations and Future Directions

The authors identify two primary limitations: (1) reduced generalization in visually rare or structurally complex scenarios due to limited data coverage, and (2) incomplete modeling of physical interactions, such as object collisions. Addressing these will require expanded, physics-aware datasets and architectural enhancements for long-term temporal consistency and richer action spaces.

Future work includes:

Extending to more complex environments beyond Minecraft (e.g., Black Myth: Wukong, CS:GO).
Enriching the action space for finer-grained control.
Incorporating memory-based mechanisms for improved long-range temporal coherence.

Implications

Matrix-Game establishes a new standard for interactive, controllable world modeling in open-ended environments. The combination of large-scale, balanced datasets, a scalable and modular architecture, and a unified evaluation protocol provides a robust foundation for future research in embodied AI, generative game engines, and agent-environment simulation. The open-sourcing of model weights and benchmarks is likely to accelerate progress in this domain, enabling broader adoption and comparative analysis.

The approach demonstrates that high-capacity, action-conditioned video diffusion models can achieve precise, physically consistent, and visually compelling world generation, paving the way for more general-purpose, interactive AI systems capable of understanding and manipulating complex virtual environments.

PDF Markdown

Related Papers

GitHub

GitHub - SkyworkAI/Matrix-Game: Matrix-Game: Interactive World Foundation Model (741 stars)

Tweets

https://twitter.com/AdinaYakup/status/1937870320309805368

https://twitter.com/_akhaliq/status/1937863839317377031

https://twitter.com/dreamingtulpa/status/1938991881301709271

https://twitter.com/HuggingPapers/status/1938811773437546730

https://twitter.com/ResearchBitesAI/status/1937984445862256909

https://twitter.com/lordOfAFew/status/1939194635798356422