PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Published 31 Dec 2025 in cs.CV | (2512.24551v1)

Abstract: Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-LLM (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

Abstract PDF Upgrade to Chat

Summary

The paper introduces PhyGDPO, achieving physically consistent text-to-video generation through a novel physics-aware direct preference optimization framework.
It leverages the PhyAugPipe pipeline and groupwise Plackett-Luce modeling to enrich training data and enhance motion coherence.
The study demonstrates significant empirical improvements and efficiency gains, setting a new baseline for physics-driven video synthesis.

Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Introduction and Motivation

The pursuit of text-to-video (T2V) models that generate videos not only with high visual fidelity but also with strong physical realism remains a challenging open problem. While recent large-scale T2V generators excel in visual quality, they frequently produce videos violating basic physical laws, especially in complex activity categories such as human motion, object-object interaction, and physics phenomena. Prior approaches—including graphics-based simulation and LLM-driven prompt extension—are constrained by poor generalization, impracticality in real-world scenarios, reliance on LLM physics reasoning weaknesses, or limited ability to foster implicit physics understanding. There is also a scarcity of well-curated datasets focusing on physics-rich interactions, impeding further advancement.

This work introduces PhyGDPO, a comprehensive framework aimed at instilling physical consistency into T2V generators. PhyGDPO is centered on three technical innovations: (1) PhyAugPipe, a VLM-powered pipeline for constructing large-scale physics-rich training data; (2) a novel groupwise direct preference optimization (DPO) method leveraging Plackett-Luce modeling with physics-guided rewards; and (3) LoRA-Switch Reference, a memory- and efficiency-optimized reference model mechanism. Taken together, these enable both improved data coverage and optimization targeted at physics-driven generative fidelity.

Physics-Augmented Data Construction: PhyAugPipe

The data pipeline, PhyAugPipe, is designed to systematically harvest text-video pairs capturing diverse physical interactions and phenomena from large, unstructured corpora. It employs Qwen-2.5-72B-Instruct as a vision-LLM (VLM) with chain-of-thought (CoT) parsing routines for each candidate, decomposing the prompt and frames into objects, materials, actions, and force relationships. This process quantifies physics richness (from 0 to 1), thresholds for sufficient physical complexity, and auto-extends prompts with explicit causal reasoning though not used in PhyGDPO training proper. Semantics-based action clustering over the filtered set ensures diverse coverage across challenging categories; reward-driven sampling, using a specialized physics-aware VLM (VideoCon-Physics), balances the dataset toward actions where the generator struggles, thus maximizing physics learning opportunity.

Figure 1: The PhyAugPipe pipeline for constructing a physics-rich text-video dataset via VLM parsing, CoT reasoning, and physics-aware reward-driven sampling.

PhyGDPO Framework: Groupwise Physics-Aligned Preference Optimization

The optimization framework of PhyGDPO leverages several key innovations over standard DPO:

Groupwise Plackett-Luce Modeling: Instead of standard Bradley-Terry pairwise preferences, PhyGDPO uses groupwise modeling, allowing alignment with holistic human feedback and capturing global properties such as motion coherence and multi-object interaction. The wining sample is always a real video, and the group is structured such that only the most physically plausible generations are pushed towards, provided by a ranked set of challenging generations per condition.
Physics-Guided Rewarding (PGR): Rather than only using the generator's likelihood as a reward signal, PhyGDPO incorporates VLM-based physics and semantics adherence scores. Parameters $\gamma_j$ and $\alpha_j$ are dynamically modulated per-group to upweight learning from physics failures, improving the discrimination and correction of non-physical generations.
LoRA-Switch Reference (LoRA-SR): Standard DPO maintains a full reference model, requiring large memory and incurring instability. PhyGDPO instead freezes the backbone and attaches lightweight LoRA modules, toggling between reference and trainable models with a switch. This method dramatically reduces memory footprint and stabilizes preference-guidance, accelerating convergence and improving practical scalability.
Figure 2: Schematic of PhyGDPO’s groupwise DPO framework with LoRA-SR and physics-guided reward mechanisms.

Empirical Results and Comparative Evaluation

Extensive benchmarking is performed on both the VideoPhy2 and PhyGenBench datasets, specifically designed for evaluating physics-grounded generative ability. When applied to the Wan2.1-T2V-14B model, PhyGDPO demonstrates significant quantitative improvements over state-of-the-art open and closed-source alternatives, including OpenAI Sora2 and Google Veo3.1, across physics-centric tasks.

Hard action (VideoPhy2): PhyGDPO yields 0.0500, compared with Sora2 (0.0389), Veo3.1 (0.0444), and VideoDPO (0.0167), a 4.5× gain over the base model on hard categories.
Physical phenomena (PhyGenBench): PhyGDPO consistently obtains state-of-the-art or co-leading scores across mechanics, optics, thermal, and material tracks, with a notable advantage in mechanics and thermal.
Human evaluation: User studies show that videos generated with PhyGDPO are preferred by up to 94.2% (vs. VideoCrafter2) and 89.4% (vs. VideoDPO) of annotators in head-to-head comparison for physical realism.
Figure 3: Qualitative demonstration across a spectrum of challenging action-driven video categories, showing enhanced physical plausibility and realistic interactions via PhyGDPO.

Figure 4: Outputs on gymnastics and polo; PhyGDPO enforces deformation-free dynamics and realistic contact interactions, surpassing prior models.

Figure 5: Generalization to arbitrary user-input actions; physically accurate racket-ball and body coordination are observed.

Figure 6: Successful modeling of complex phenomena (e.g., light refraction, flame propagation) not captured by baseline generative models.

Component and Ablation Analysis

Through systematic ablation, each subsystem of PhyGDPO is validated:

PhyAugPipe stages: Removing CoT, clustering, or physics-driven sampling diminishes performance, demonstrating the need for each to acquire challenging and informative data.
Core PhyGDPO mechanisms: The replacement of groupwise modeling or PGR with standard DPO or without LoRA-SR leads to substantial drops in particularly hard physical tracks.
LoRA-SR impact: Achieves up to 44% less GPU memory usage and 60× storage compression, with consistent or superior numerical scores and visual results, compared to vanilla full-model reference approaches.
Figure 7: Visual progression as LoRA-SR, groupwise loss, and physics-guided rewarding are engaged, with notable improvements in adherence to physics, coherence of body pose, and object-object contact accuracy.

Theoretical and Practical Implications

PhyGDPO sets forth a scalable solution for aligning video foundation models with physical laws via explicit reward guidance and preference supervision at scale. The introduction of groupwise preference modeling marks a theoretically well-motivated improvement over the ubiquitous pairwise schemes. PGR demonstrates the advantages of integrating physics-aware rewarders, suggesting a pathway towards broader behavioral alignment (beyond aesthetic preference) in generative models. LoRA-SR unlocks practical, efficient DPO post-training for extremely large models.

Practically, the framework enables physics-grounded video synthesis relevant to simulation, gaming, autonomous systems, robotics, and scientific visualization. The dataset construction strategy, allied with preference-based optimization, admits extensions towards new physical phenomena, further encompassing real-world complexity.

Conclusion

PhyGDPO represents an integrated framework for post-training T2V generators to adhere to physical realism, leveraging curated physics-rich data, groupwise preference models, and dynamic physics-guided rewards, all undergirded by efficient memory management. Its strong empirical performance in both automated and human assessments establishes new baselines in physically consistent video generation. Future directions include expansion to causal intervention tasks, actor-conditioned simulation, and real-time generative agents tightly integrated with downstream physical reasoning engines.

Markdown