PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding (2511.20562v1)

Published 25 Nov 2025 in cs.CV

Abstract: While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.

Summary

The paper introduces PhysChoreo, a two-stage framework that reconstructs per-part physical fields to enable controllable physics in video generation.
It employs segmentation priors, soft assignment, and hierarchical cross-attention for fine-grained material property prediction from images.
Experimental results show improved material accuracy and physical commonsense, outperforming state-of-the-art methods in visual fidelity and semantic alignment.

PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Motivation and Context

Contemporary image-to-video generation frameworks exhibit limitations in explicit physical plausibility and controllability of generated content. Existing approaches are primarily inductive, learning to mimic visible motion patterns from large-scale data but failing to encode causal knowledge of physics, resulting in sequences that lack behavioral realism—especially under nontrivial or counterfactual manipulations. Previous attempts to integrate physical simulation into the generative process either impose coarse predictions lacking part-level granularity or are constrained by insufficiently flexible simulation mechanisms. The PhysChoreo framework is introduced to address these constraints with a unified architecture capable of reconstructing per-part physical fields from images and delivering temporally-instructed, physically grounded, and controllable video generation.

Figure 1: Overview of the PhysChoreo pipeline, illustrating the transition from image and prompt input to trajectory-based conditional generative video synthesis.

Methodology

PhysChoreo is architected as a two-stage pipeline. The first stage performs part-aware physical property reconstruction, while the second stage synthesizes dynamic video sequences via physics-editable simulation guided by temporally-structured user instructions.

Part-Aware Physics Reconstruction

This module infers fine-grained material fields for objects segmented from the input image, leveraging both global and part-level textual prompts. The pipeline utilizes segmentation priors, positional features, and transformer-based encoding to produce interpretable physical properties at point cloud granularity. Information injection proceeds via soft assignment—aligning part text prompts with spatial features—followed by hierarchical cross-attention for coherence across both global and local semantic levels.

Figure 2: Schematic of the model architecture demonstrating fused feature embedding, soft assignment, cross-attention, and transformer-based part-aware material field prediction.

Multi-component supervision is applied, comprising:

Point-wise task loss for semantic and continuous parameter regression.
Wave continuity constraints reflecting spatial smoothness in physically derived velocity fields for both $c_p$ and $c_s$ .
Contrastive regularization to enforce part-wise separability in moduli space, optimizing interface behaviors.
Cross-entropy alignment between soft assignment distributions and ground-truth part labels.

The net effect is robust, interpretable per-point material and dynamic parameter prediction aligned both textually and geometrically.

Physics-Editable Video Generation

The predicted physical fields initialize object states in physics simulators (e.g., MPM, rigid body), allowing temporal control of parameters such as Young's modulus, density, external forces, and constitutive models. Temporal interventions—including transitions like liquefaction, deformation, counter-intuitive motion responses—are directly programmable and propagate through dynamic simulation. This yields physically plausible motion trajectories, subsequently utilized as conditioning for state-of-the-art generative video models.

Volumetric completion is achieved by interior particle generation from surface points, maintaining seamless property propagation via k-NN interpolation. The simulation-generated trajectory serves as the video model’s control input, ensuring semantic and physical congruence in the output videos.

Figure 3: Demonstration of part-level physical property prediction controllable through textual conditions.

Dataset Construction

The authors assembled a large-scale point cloud dataset, coupling hierarchical part segmentation, rich textual annotation, and physically grounded material properties (including modulus, density, Poisson’s ratio). Counterfactual labels were deliberately included to test the model’s ability to infer nonstandard physical semantics, enhancing generalization capability and stress-testing the language-physics interface.

Experimental Results

Physical Property Prediction

Quantitative comparison shows PhysChoreo surpasses NeRF2Physics, PUGS, and Pixie across all evaluation metrics:

Material model prediction accuracy: 0.789 (Ours) vs. 0.628 (NeRF2Physics)
log $E$ error: 0.661 (Ours) lowest among baselines

The model demonstrates explicit controllability at the part level using text-based instructions, enabling downstream physically faithful simulations.

Physics-Controllable Video Generation

PhysChoreo’s generated videos are evaluated using Gemini-2.5-Pro and direct user studies, with metrics covering semantic alignment (SA), physical commonsense (PC), and visual quality (VQ). The model achieves the best scores on PC and SA, outperforming methods like PhysGen3D, Wan2.2-5B, CogVideoX-3, and Veo 3.1.

Figure 4: Qualitative comparison between PhysChoreo and other state-of-the-art image-to-video models, illustrating physical plausibility and complex dynamic behaviors.

PhysChoreo supports diverse temporal manipulations and robustly realizes cinematic and counterfactual physical behaviors (e.g., liquefaction, collapse, abnormal bounces) while maintaining visual fidelity and coherence.

Figure 5: PhysChoreo-generated sequences showing high visual quality and physical realism under fine-grained physical property control.

Ablation Studies

Component-level ablations verify the effectiveness of soft assignment, hierarchical cross-attention, and segmentation prior in improving prediction accuracy and learning convergence. Removal of any loss function module (assignment, smoothness, contrast) decreases accuracy and increases error, substantiating their necessity for robust physical field reconstruction and editability.

Figure 6: Qualitative ablation showing differences in control efficacy among alternative trajectory and conditioning strategies.

Implications and Future Directions

PhysChoreo advances the interface between semantic language conditioning and precise physical modeling in generative frameworks, making a decisive step towards causally robust, counterfactually controllable video synthesis. The part-aware semantic grounding introduces new degrees of fine-grained editability and interpretability, with implications for robotics simulation, physics-based animation, and scientific visualization. However, the current restriction to independently behaving objects and limited internal state inference signals opportunities for scaling to larger, multi-object scenes and enhancing interior property modeling. The unification of semantic and physical parameter editing promises scalable extensions to reasoning, planning, and interactive AI agents.

Conclusion

PhysChoreo provides an effective framework for reconstructing part-wise physical fields from images and leveraging them for physics-grounded, controllable video generation. Its explicit integration of physical simulation and textual semantics sets new standards for realism, editability, and alignment in generative visual models. Continued development should focus on scene-scale generalization and internal state estimation to broaden applicability and fidelity.