Matrix-Game-MC Dataset for Interactive Minecraft Worlds
- Matrix-Game-MC dataset is a comprehensive resource that combines over 2,700 hours of unlabeled 720p gameplay with 1,000+ hours of densely annotated action videos.
- It employs a multi-stage pipeline with rigorous quality, content, and motion filtering to ensure temporally coherent and visually realistic data.
- The dataset underpins interactive world modeling by enabling high action controllability and temporal consistency across diverse Minecraft biomes and scenarios.
The Matrix-Game-MC dataset is a large-scale, high-fidelity resource designed to advance interactive world foundation models, with a primary focus on Minecraft. It serves as the backbone for training and evaluating the Matrix-Game model for controllable game world generation, emphasizing action-controllable, temporally coherent, and visually realistic video synthesis conditioned on agent actions and world context (2506.18701).
1. Composition and Structure
Matrix-Game-MC integrates extensive video and action data, organized as follows:
- Unlabeled Video Clips:
Over 2,700 hours of unlabeled 720p gameplay video, sampled from a variety of Minecraft biomes, including forest, desert, icy, mushroom, plains, and others. These clips provide broad coverage of the environmental and visual diversity present in Minecraft, supporting pretraining for general environment understanding.
- Labeled Action Clips:
- Discrete keyboard actions (move up/down/left/right, jump, attack), recorded as per-frame one-hot vectors.
- Continuous mouse movements (camera pitch and yaw changes), captured per frame.
- Temporal synchronization at 16Hz between video frames and control signals.
- Biome Coverage:
The labeled dataset is explicitly balanced over 14+ biomes; for instance, Desert (7.9%), Icy (6.8%), Mushroom (6.3%), Forest (4.0%), Plains (6.0%), and a "Random" category (14%) ensure diversity and mitigate bias toward common scenarios.
- Supplementary Data:
Additional labeled data is synthesized via Unreal Engine-based simulations, providing richer physics and kinematic annotations.
2. Data Collection, Filtering, and Annotation Pipeline
A multi-stage pipeline is employed to guarantee the visual quality, semantic relevance, and controllability of the dataset:
- Collection Sources:
- Raw videos principally sourced from public repositories such as MineDojo, amounting to an initial pool of approximately 6,000 hours.
- Labeled action data is generated by curriculum-guided VPT agents in extended MineRL environments, complemented by procedurally generated scenarios in Unreal Engine.
- Segmentation and Preprocessing:
- Scene transitions detected via TransNet V2, followed by FFmpeg-based segmentation.
- Conversion to standardized libx264 format, dropping boundary frames to eliminate transition artifacts.
- Hierarchical Filtering:
- Filtering with DOVER for basic video quality.
- LAION aesthetic predictor applied to enforce visual criteria.
- 2. Content Filtering:
- Removal of non-game scenes using an inverse dynamics model (IDM) for menus, CRAFT for subtitle filtering, and DeepFace for excising human faces.
- 3. Motion & Camera Filtering:
- GMFlow estimates optical flow, enforcing sufficient but smooth motion.
- Restriction on per-frame camera yaw/pitch (≤ ), ensuring stable viewpoint transitions as measured again by the IDM.
- 4. Control Quality:
- Exclusion of non-gameplay frames and prevention of artifacts via engine customizations (e.g., disabling frustum-based chunk loading; terminating recording on menu triggers).
- Temporal and spatial alignment of action signals with visual frames.
- Scenario Balance:
Manual and procedural curation ensures uniform scene and biome representation.
3. Role in Model Training and Evaluation Paradigm
Matrix-Game-MC underpins the two-stage training pipeline for the Matrix-Game world model:
- Stage 1 (Unlabeled Pretraining):
The model is exposed to the full range of Minecraft environments and motions via the unlabeled videos, learning spatial-temporal representations and general environmental dynamics without action supervision.
- Stage 2 (Action-Conditioned Training):
- Balanced scenario exposure during this phase is critical to achieving robust generalization and preventing overfitting to predominant locations or agent behaviors.
- The labeled data also serves as ground-truth for controllability metrics during evaluation.
- Evaluation:
Labeled videos are used for metrics such as action controllability (accuracy of generated action traces), GameWorld Score (covering visual, temporal, physical rule comprehension), and temporal/spatial consistency, employing modules such as IDM for analysis.
4. Comparative Position and Distinctive Features
Matrix-Game-MC significantly extends prior datasets along multiple axes:
Feature | Matrix-Game-MC | Prior datasets (e.g., OASIS, MineWorld) |
---|---|---|
Unlabeled video | 2,700+ hours | ~1,000–2,000 hours |
Labeled action data | 1,000+ hours, fine-grained | ≤ 200 hours, often coarse-grained |
Biome coverage | 14+ (explicitly balanced) | Few or unbalanced |
Action labels | Frame-accurate, both discrete (keyboard) and continuous (mouse) | Often keyboard only, not continuous |
Physics/kinematics labels | Present (Unreal sim) | Not typically provided |
Key advances include:
- High-quality filtering, yielding significant gains in visual and temporal consistency.
- Dense, temporally synchronized action labeling for both agent movement and camera orientation.
- Coverage of rare or complex scenarios, facilitating model generalization.
5. Empirical Impact on Modeling
Matrix-Game-MC has enabled state-of-the-art performance in the controllable game world generation setting:
- Controllability:
Achieves keyboard and mouse action accuracy ≥0.95, a substantial improvement over prior scores of 0.86–0.87 (keyboard) and 0.56–0.64 (mouse) reported for MineWorld and OASIS.
- Visual and Scenario Consistency:
Achieves high object and scenario consistencies (e.g., 0.76 object, 0.93 scenario), supporting physical plausibility and temporal stability in generated sequences.
- Generalization:
Balanced coverage across biomes and motion types supports transfer to unseen scenarios and rare world compositions.
- Temporal Coherence:
Rigorous camera-motion constraints and artifact avoidance result in reduced flickering, jump-cuts, and scene inconsistencies.
6. Applications and Future Directions
Matrix-Game-MC provides a foundation for research in multiple areas:
- Interactive World and Agent Modeling:
Enables high-fidelity simulation of user interactions, agent navigation, and long-horizon planning conditioned on multi-modal agent controls.
- Evaluation Benchmarking:
The dataset is integral to GameWorld Score, a comprehensive benchmark for action controllability, physical rule understanding, and visual/temporal quality.
- Physics-Aware Generation:
Supplementary kinematics and physics annotations unlock investigation into physically consistent video synthesis.
- Transferability:
The collection and filtering pipeline are adaptable to other simulation environments, facilitating the extension to new game genres and even real-world robotics scenarios.
- Complex Action and Memory Modeling:
The length and density of data support research on long-horizon behavior, composite actions, and world models requiring memory.
7. Technical Highlights
- Action encoding:
Discrete actions as one-hot, continuous as per-frame deltas.
- Temporal alignment:
Actions, video, and additional physics/kinematic ground-truth labels are synchronized at 16Hz.
- Filtering criteria:
Camera angle change capped at /frame; object consistency measured by 3D reprojection error across co-visible frames.
- Balanced sampling:
Explicit sampling quotas and procedural generation to ensure robust coverage by biome and scenario.
Matrix-Game-MC establishes a new standard for Minecraft datasets focused on the research and development of controllable, interactive world models and presents methodologies, scale, and diversity not matched by previous resources (2506.18701).