DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling (2512.03000v2)

Published 2 Dec 2025 in cs.CV

Abstract: Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

Summary

The paper's main contribution is the DynamicGen pipeline that integrates metric reconstruction, multi-object recovery, and hierarchical captioning for accurate 4D world modeling.
It reports superior performance in video depth, camera pose, and segmentation tasks, outperforming baselines on datasets like Sintel and TUM-dynamics.
The framework bridges the synthetic-real gap with scalable in-the-wild data, driving advancements in embodied AI and 4D vision-language models.

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Motivation and Problem Statement

Scaling physically-grounded, multi-modal 4D scene understanding remains a persistent bottleneck for embodied perception, vision-language modeling, and physically-realistic digital twins. The absence of large, truly “world-scale,” semantically-rich 4D datasets with metric geometry, motion, and detailed captions constrains the development of generalist AI models capable of operating on dynamic, real-world video. Most existing datasets are constrained to controlled settings (e.g., indoor, driving), synthetic content, or lack critical annotations (such as metric depth, camera intrinsics/extrinsics, or detailed language grounding). DynamicVerse addresses this gap by introducing an automated scalable pipeline—DynamicGen—for extracting holistic 4D multi-modal representations from in-the-wild monocular video while delivering a large-scale, richly annotated benchmark dataset.

Figure 1: The overview of physically-aware multi-modal world modeling framework DynamicVerse.

DynamicVerse Framework: Data Generation and Curation

DynamicGen, the core pipeline of DynamicVerse, operationalizes a physically-aware 4D world modeling system by combining metric reconstruction, dense multi-object recovery, and hierarchical semantic captioning.

Data Sources and Scale:

DynamicVerse aggregates over 100K web-scale and synthetic 4D video sequences, yielding 800K+ annotated instance masks and 10M+ frames. The diverse sources ensure broad scene and motion coverage and facilitate generalization for downstream foundation models.

Figure 2: The statistics and data source of DynamicVerse.

DynamicGen Pipeline Structure:

Video Curation and Filtering: Extensive filtering is required to ensure data utility. DynamicGen trains a Random Forest video quality predictor on metrics such as depth statistics, blur, focal stability, and camera motion. VLM-based categorical exclusion further refines the curated pool.
Foundation Model-Annotated Masks and Semantics: Qwen2.5-VL and SA2VA are used in tandem: Qwen2.5-VL proposes object categories and tracks dynamic actors, which are then tightly segmented by SA2VA. The system incorporates per-frame metric-scale depth via UniDepthv2, as well as geometric and tracking features via CoTracker3 and UniMatch.
Physically-Consistent 4D Geometry Recovery: DynamicGen employs a multi-stage bundle adjustment strategy, segmenting static and non-rigid dynamic points, aligning depth, pose, and scene geometry with global optimization. Unique to this approach is the use of hierarchical dynamic masking and sliding-window flow-based refinement, which ensures both local and global consistency across diverse video content.
Hierarchical Captioning:

Grounded captions are generated at three semantic levels: (i) object (via DAM), (ii) scene (via structured LLM prompting to Qwen2.5-VL), and (iii) camera motion (yielding trajectory-level descriptions via specialized VLMs). Chain-of-Thought (CoT) LLM prompting and iterative human-in-the-loop review ensure fluency, relevance, and detail.

Figure 3: The physically-aware multi-modal 4D data generation pipeline DynamicGen.

Benchmarks and Results

DynamicVerse establishes new state-of-the-art results on classic physically-grounded video tasks: depth estimation, camera pose estimation, and camera intrinsics estimation.

Video Depth Estimation

DynamicGen demonstrates superior performance versus strong baselines (e.g., MonST3R, Uni4D, Depth-Pro, Metric3D, DepthCrafter) on both Sintel and KITTI. On Sintel, it achieves an absolute relative error of 0.205 and $\delta_{1.25}$ of 72.9, outperforming all alternatives in both absolute and geometric consistency metrics. Qualitatively, reconstructions are denser, less noisy, and more accurately capture dynamic object geometry compared to prior methods.

Figure 4: Visual comparisons of 4D reconstruction on in-the-wild data.

Camera Pose and Intrinsics Estimation

DynamicGen sets a new standard across absolute trajectory error (ATE), relative translation, and rotation errors on both synthetic (Sintel) and real-world (TUM-dynamics) video. On Sintel, it yields ATE of 0.108 and RPE (rotation) of 0.282, outperforming both classic and learning-based approaches such as DPVO, LEAP-VO, Robust-CVD, and MonST3R. For focal length/imaging intrinsics, DynamicGen exhibits lower absolute and relative error compared to strong single-image and multi-frame methods (e.g., UniDepth, Dust3r).

Physically-Aware Segmentation

Leveraging both 2D and geometric priors, DynamicVerse delivers improved moving object segmentation, showcasing robust, interpretable masks.

Figure 5: Qualitative Results of Moving Object Segmentation on Youtube-VIS, indicating superior instance mask precision compared to prior work.

Hierarchical Captioning

In both human and LLM-as-Judge evaluations (via G-VEval), captions generated by DynamicGen are more accurate, relevant, complete, and concise compared to direct vision-language outputs or prior datasets. Heavy use of hierarchical prompting, per-object grounding, and CoT-based refinement results in highly actionable, contextually nuanced semantic annotations.

Figure 6: Examples captions on DAVIS dataset, demonstrating object, scene, and camera motion caption granularity.

Practical and Theoretical Implications

DynamicVerse’s scalable and physically-aware methodology has several critical implications:

Embodied AI: Enables physically-grounded perception and policy learning in 4D environments, facilitating advancements in robotics, AR/VR, human-agent interaction, and digital twins due to improved dynamic scene understanding.
4D Vision-LLMs: Provides the annotation density and diversity needed for spatio-temporal reasoning, object-level language grounding, and intra/inter-object dynamics understanding. This positions DynamicVerse as a foundation for the next generation of 4D VLMs and language-grounded Gaussian Splatting models.
Real-World Generalization: Demonstrates that web-scale, unconstrained video, if properly curated and annotated, can deliver metric-scale, physically plausible reconstructions and segmentations, closing the synthetic-to-real gap present in prior 4D datasets.
Figure 7: Qualitative Results on in-the-wild reconstruction, illustrating the pipeline’s generalization beyond curated datasets.

Limitations and Future Directions

Despite substantial progress, DynamicVerse inherits several limitations:

Data Bias and Noise: In-the-wild video data still introduces annotation noise and class imbalance, which may bias learning and impair generalization to tail scenarios.
Computational Overheads: Processing remains costly due to reliance on large foundation models and multi-stage optimization, which limits real-time or resource-constrained deployment.
Coverage Limits: Even with its scale, DynamicVerse cannot exhaustively model all real-world dynamics or scene structures, especially in extreme occlusion or highly non-rigid scenarios.

Future work should address domain-specific generalization, more efficient annotation by leveraging accompanying sensor data where available, and integration with reinforcement-learning-based embodied agents for closed-loop evaluation and data augmentation.

Conclusion

DynamicVerse, through its DynamicGen pipeline, delivers both a novel methodology for physically-aware 4D multi-modal data extraction and a substantive benchmark dataset. Experimental evidence confirms state-of-the-art performance in core spatial-temporal vision tasks and establishes a scalable framework for holistic 4D understanding. The generality and annotation richness of DynamicVerse make it a critical resource for 4D scene understanding, driving broader advances in foundation models, embodied intelligence, and language-grounded scene reasoning.