OmniWorld-Game: 4D World Modeling Dataset

Updated 16 September 2025

OmniWorld-Game is a large-scale, multi-modal dataset that provides synchronized RGB, depth, camera, text, optical flow, and mask annotations for 4D world modeling.
It comprises over 96,000 synthetic video clips capturing detailed spatial-temporal dynamics across diverse indoor and outdoor settings.
The dataset supports benchmark tasks such as depth estimation, camera-controlled video generation, and motion analysis, driving state-of-the-art advancements.

The OmniWorld-Game dataset is a large-scale, multi-modal, and highly dynamic resource developed specifically for advancing 4D world modeling. It is the principal self-collected component of the broader OmniWorld dataset collection and is designed to support the development and evaluation of models that simultaneously capture spatial geometry and temporal dynamics. By providing dense, temporally consistent, and richly annotated data drawn from realistic game environments, OmniWorld-Game addresses core limitations of prior benchmarks and establishes new standards for diversity, modality coverage, and benchmarking rigor in video-centric, physically realistic simulation.

1. Dataset Composition and Modalities

OmniWorld-Game consists of high-quality synthetic video data systematically captured from game environments that feature both agent and object interactions. The dataset is distinguished by tightly synchronized, multi-modal annotations essential for 4D world modeling.

Modality	Description	Purpose
RGB Images	720P color frames	Visual appearance and photometric context
Depth Maps	Dense, temporally consistent per-frame depth	Geometric structure and 3D reconstruction
Camera Poses	Accurate, per-frame camera extrinsics from a two-stage automatic pipeline	Frame-to-frame geometry, temporal alignment
Text Captions	Multi-viewpoint, semi-automated captions (Qwen2-VL-72B-Instruct)	High-level semantics and scene description
Optical Flow	Dense, per-frame pixel motion fields	Capturing object and camera dynamics
Foreground Masks	Segmentation of primary dynamic objects	Motion grouping, dynamic object analysis

Each modality is precisely aligned in time, creating a dataset that supports learning and evaluation across tasks such as 4D geometric reconstruction, motion analysis, video captioning, future prediction, and camera-controlled video generation. Depth maps paired with accurate camera poses facilitate robust spatio-temporal fusion, while the combination of optical flow, text, and foreground masks supports nuanced modeling of dynamic scenes.

2. Scale, Diversity, and Domain Coverage

The dataset achieves exceptional scale and diversity, with over 96,000 synthetic video clips comprising more than 18 million frames (totaling over 214 hours of dynamic RGB-D data). Unlike earlier synthetic datasets—such as MPI Sintel or TartanAir, which are limited by duration and annotated modality richness—OmniWorld-Game offers both longer temporal sequences (up to 16 seconds per clip, or 384 frames for some tasks) and more comprehensive annotations.

The content scenarios span both indoor and outdoor environments, range across varying lighting conditions (e.g., day/night cycles), and cover a breadth of aesthetic and domain styles, including historically themed or futuristic settings. As part of the larger OmniWorld collection, it complements simulator, robot, human, and Internet domains, thereby strengthening its generalization value for research communities interested in simulating, predicting, or reconstructing visually diverse worlds.

3. Representation of Dynamic Interactions

Central to OmniWorld-Game is its emphasis on dynamic interaction. Data are sourced from environments where both the camera and multiple in-scene objects are in motion, yielding large-amplitude, high-velocity object movements and diverse, non-trivial camera trajectories. These dynamics differentiate the dataset from prior resources and support the critical learning of temporal consistency, non-rigid motion, and complex interactions.

Foreground masks and optical flow fields are employed to decompose the moving scene into temporally coherent layers, distinguish moving objects from static backgrounds, and support tasks like motion grouping, future trajectory prediction, and temporally consistent depth or object mapping. This comprehensive dynamic coverage is indispensable for advancing methods in 4D modeling, as it enables the exploration of spatio-temporal scenes in a manner much closer to real-world complexity.

4. Benchmark Tasks and Evaluation Protocols

OmniWorld-Game serves as the basis for benchmarks that systematically probe model performance on key tasks in modern 4D world modeling. Two primary research communities are addressed:

3D Geometric Foundation Models: Tasks include monocular depth estimation and video depth estimation, with an emphasis on capturing temporally consistent 3D structure in the presence of dynamic content.
Camera Control Video Generation: Models are tested on their ability to generate realistic videos conditioned on explicit camera trajectories and scene instructions.

Key evaluation metrics include:

Absolute Relative Error (Abs Rel):

$\text{Abs Rel} = \frac{1}{N} \sum_{i=1}^N \frac{|d_i - \hat{d}_i|}{d_i}$

where $d_i$ and $\hat{d}_i$ are ground-truth and predicted depths.

Threshold Accuracy ( $\delta < 1.25$ ): Measures the percentage of pixels where the predicted depth is within 25% of the ground truth.
Camera Pose Estimation: Assessed via rotational error, translational error, and composite metrics (e.g., CamMC).
Video Generation Quality: Evaluated by Fréchet Video Distance (FVD) to capture perceptual video similarity, alongside camera trajectory adherence (RotError, TransError).

These evaluation protocols are designed to stress-test models on extended sequences, dynamic scenes, and challenging spatial-temporal alignment, exposing the current limitations of state-of-the-art (SOTA) algorithms.

5. Empirical Impact on State-of-the-Art Methods

Fine-tuning existing SOTA approaches on OmniWorld-Game yields quantifiable performance improvements across 4D world modeling benchmarks. For example, methods such as DUSt3R, CUT3R, and MoGe-2 demonstrate reduced mean Abs Rel errors and improved threshold accuracies after training on this dataset. In camera pose estimation, both rotational and translational errors decrease post fine-tuning, attesting to the dataset's contribution of stronger geometric supervision over longer sequences.

For video generation with camera control, models like AC3D exhibit increased fidelity to user-specified trajectories and enhanced visual-temporal coherence. These empirical improvements indicate that the extensive annotation coverage, dynamic content, and rigorous synchronization inherent to OmniWorld-Game provide critical training signals absent from prior resources.

A plausible implication is that such datasets could drive the next wave of 4D world models capable of real-time, holistic physical understanding and prediction.

6. Significance and Vision for 4D World Modeling

OmniWorld-Game, as part of the OmniWorld initiative, is positioned as a catalyst for general-purpose 4D world models and holistic machine perception. By bridging static 3D reconstruction data and temporally rich, annotated video understanding, it enables research into joint spatial and temporal modeling—cornerstones for applications such as predictive simulation, physical scene understanding, and advanced virtual/augmented reality content generation.

The dataset supports research into the simulation of complex environments, accurate prediction of future scene dynamics, the generation of camera-controlled video content, and other tasks requiring nuanced comprehension of object interactions over time. The breadth of modalities, high-quality annotations, and focus on realistic dynamic interactions mark a step change in addressing previous constraints on scale, diversity, and spatio-temporal coverage observed in synthetic benchmarks.

The long-term vision centers on enabling machine systems to capture, reconstruct, and simulate the physical world with precision and robustness, facilitating advances not only in foundational research but also in practical domains such as robotics, autonomous driving, and embodied AI.

7. Concluding Remarks

OmniWorld-Game distinguishes itself through its massive scale, comprehensive multi-modal annotations (including RGB, depth, camera pose, text, optical flow, and foreground masks), and close alignment with the demands of 4D world modeling. By providing a challenging benchmark for geometric prediction and camera-controlled video generation, and demonstrably improving SOTA model performance through fine-tuning, it establishes itself as a foundational resource for the next generation of models bridging spatial geometry and temporal dynamics. Its design and application point toward future research characterized by truly joint spatio-temporal modeling, furthering the holistic understanding of the physical world by artificial agents (Zhou et al., 15 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to OmniWorld-Game Dataset.