- The paper introduces a dual-stream teacher-student architecture that aligns RGB and perceptual modalities for enhanced physical plausibility.
- The methodology employs relation-based representation alignment and a physics-rich annotation pipeline to boost spatio-temporal consistency.
- Empirical evaluations show improved physical consistency and semantic alignment, advancing applications in simulation, content creation, and robust world modeling.
MMPhysVideo: Advancing Physical Plausibility in Video Generation via Joint Multimodal Modeling
Motivation and Problem Statement
While latent video diffusion models (VDMs) have established state-of-the-art benchmarks in visual quality, they persistently exhibit limitations in adhering to physical principles—manifested as semantic ambiguities, inconsistent spatial geometry, and implausible object dynamics. Existing strategies to integrate physics into video generation typically rely on prompt engineering to encode physical priors, or supervised post-tuning leveraging foundation models, but both remain constrained by fundamental alignment gaps and inefficiencies. Reinforcement learning-based approaches further encounter reward hacking and elevated training costs. Thus, a robust mechanism is needed for explicit and scalable physical content integration within generative video architectures.
MMPhysVideo Framework: Architectural Innovations
Dual-Stream Bidirectionally Controlled Teacher
MMPhysVideo recasts key perceptual cues (semantics, geometry, spatio-temporal trajectories) into a unified pseudo-RGB format, directly leveraging the compatibility of VDMs with RGB data. The framework introduces a Bidirectionally Controlled Teacher (BCT) which comprises parallel branches for RGB and perceptual modalities, sharing model weights yet operating on independent computational streams. This decoupling addresses inter-modal interference and preserves pre-trained RGB priors. Two sets of learnable task embeddings and zero-initialized bidirectional control links enable progressive pixel-wise alignment and correspondence across modalities, fostering precise physical dynamics modeling. This architecture outperforms conventional channel- or spatial-wise concatenation in both preserving modality fidelity and ensuring fine-grained cross-modal interaction.
Representation Alignment Distillation
To enable efficient inference, the BCT is distilled into a single-stream student model via relation-based representation alignment. Rather than regressing raw features, token similarities are aligned across spatial and temporal dimensions using a trainable projector MLP. This approach preserves the dual-stream teacher's spatio-temporal dependencies, enabling the student model to internalize physical priors without ongoing perceptual branch involvement or increased computational burden.
MMPhysPipe: Scalable Physics-Driven Data Engine
Multi-Step Curation
MMPhysPipe systematically curates physics-rich multimodal video datasets from large-scale text-video sources. First, a Video Quality Assessment (VQA) model filters out artifacts and low-dynamic samples. Next, a Vision-LLM (VLM; Qwen3-VL) rigorously assesses real-world authenticity using stepwise Chain-of-Thought rules on style, context, graphics, and continuity, discarding synthetic or over-processed content.
Physical Richness and Resampling
Physical phenomena are categorized into dynamics, thermodynamics, and optics, with domain-specific primitives scored for intensity. Chain-of-Visual-Evidence (CoVE) provides interpretable reasoning for richness evaluation, mitigating VLM hallucination. To address domain imbalance, multi-label balanced resampling amplifies representation of rare phenomena, ensuring dataset diversity.
Multimodal Annotation
For high-intensity physical primitives, MMPhysPipe uses the SAM3 model for open-vocabulary segmentation, then applies unified 3D point tracking (SpatialTrackerV2, VGGT) within masked regions to annotate geometry and dense trajectory information. Unified pseudo-RGB videos are generated by merging these multimodal annotations, facilitating continuous information flow from global semantics to local physical granularity.
Empirical Evaluation
Quantitative Benchmarks
Extensive evaluation on physics-centric benchmarks (VideoPhy, PhyGenBench) demonstrates MMPhysVideo’s consistent improvements in Physical Consistency (PC) and Semantic Alignment (SA) across multiple advanced T2V backbones (e.g., CogVideoX-5B, Wan2.1-1.3B). For CogVideoX-5B, MMPhysVideo yields a 5.0% gain on VideoPhy and 6.2% gain on PhyGenBench average PC scores, outperforming specialized SFT (PhyT2V, VideoREPA) and multimodal baselines (OmniVDiff). Gains are preserved even with lighter backbones and smaller datasets, underscoring robustness and generality.
Qualitative Analysis
Generated outputs exhibit superior structural integrity and temporally plausible motion across dynamically complex scenarios (fluid flow, object manipulation, collision). MMPhysVideo reliably captures causal relationships and object interactions, such as maintaining correct spatial alignment in pouring and manipulating scenes, which baseline models consistently fail.
Ablations and Component Analysis
Dual-stream architecture with pixel-wise control links outperforms channel-wise and spatial-wise fusion methods by reducing inter-modal interference and enhancing PC/SA scores. Unified multimodal annotation outstrips all single-modality variants, providing complementary perception cues. Component-wise ablations confirm that each module of MMPhysPipe (quality filtering, reality scoring, richness labeling, resampling, multimodal annotation) cumulatively strengthens the dataset and resultant physical plausibility.
Distillation Effectiveness
Relation-based representation alignment ensures the distilled student model achieves parity with the dual-stream teacher, optimized for inference speed without compromising physical consistency. Direct regression approaches demonstrate degraded performance, corroborating the criticality of relational distillation.
General Quality Benchmarks
Evaluation on VBench confirms MMPhysVideo’s capacity to enhance subject consistency, motion smoothness, and aesthetic/imaging quality. Spatio-temporal structural metrics are particularly improved, indicating that geometric and trajectory cues are effectively internalized.
Implications and Future Directions
MMPhysVideo sets state-of-the-art metrics for physically plausible video generation, demonstrating that joint RGB-perception modeling with architecture-level decoupling and explicit relational alignment can overcome domain gaps inherent in previous physical prior integration schemes. Practically, this advance unlocks avenues for embodied agents, simulation, content creation, and robust world modeling. Theoretically, the framework establishes a scalable blueprint for leveraging multimodal perception as an inherent prior in generative models, underscoring the symbiosis between unified data formats and architectural modularity.
Looking forward, further research may extend MMPhysVideo’s multimodal pipeline to include tactile, auditory, or force-sensor modalities; explore multi-agent physical reasoning; and unify perception-driven distillation across broader generative modeling tasks. Benchmarking against emerging world simulation platforms and real-time robotics environments will clarify practical deployment potential and inform refinement of perception-aligned generative methods.
Conclusion
MMPhysVideo introduces a principled framework for scaling physical plausibility in video generation via joint multimodal modeling. By combining a novel decoupled teacher-student architecture with an automated, physics-rich annotation pipeline, the approach enables VDMs to directly capture complex physical dynamics with explicit alignment. Empirical results validate MMPhysVideo’s consistent superiority across benchmarks and models, redefining the integration of physical priors and perception-driven supervision for generative video systems (2604.02817).