Physics-Based Video Generation & Editing
- Physics-based video generation and editing is a research area that synthesizes video content with dynamics governed by physical laws such as momentum, gravity, and friction.
- It integrates simulation methods, deep generative models, and advanced vision systems to create temporally coherent and photorealistic video outputs.
- Recent advancements highlight effective simulation-guided and prior-distillation pipelines that enable user-controllable, physically plausible video retargeting.
Physics-based video generation and editing is a research frontier aiming to synthesize video content or edit existing footage such that the resulting dynamics respect physical laws — including conservation of momentum, elasticity, friction, gravity, and complex material behavior — and simultaneously achieve photorealism, temporal coherence, and user-controllable interactivity. The field combines physical simulation (continuum mechanics, rigid-body and deformable/particle-based solvers), deep generative models (especially diffusion-based video priors), and advanced vision systems for semantic and geometric understanding, producing systems that can animate objects under user-specified actions, change material properties, or manipulate forces in a physically plausible manner. Below, the state of the art is summarized across methodologies, model architectures, controllability paradigms, editing interfaces, benchmarks, and outstanding challenges.
1. Core Methodologies: Simulation-Grounded and Prior-Distillation Pipelines
Physics-based video generation approaches can be categorized by the tightness of coupling between explicit simulation models and generative video priors:
- Simulation-Guided Generation: Methods like PhysDreamer (Zhang et al., 19 Apr 2024), WonderPlay (Li et al., 23 May 2025), PhysGen3D (Chen et al., 26 Mar 2025), Sync4D (Fu et al., 27 May 2024), PhysMotion (Tan et al., 26 Nov 2024), and Phys4DGen (Lin et al., 25 Nov 2024) reconstruct a 3D scene (typically via multi-view optimization, Gaussian splatting, or mesh reconstruction), infer material parameters, and use physically principled solvers (e.g., Material Point Method, rigid-body ODEs, corotated elasticity) to simulate object motion under user-applied forces. The generated particle or mesh trajectories are then rendered (either via differentiable rasterization or by driving a video diffusion model) into temporally coherent, visually high-fidelity video sequences.
- Prior-Distillation and Hybrid Pipelines: PhysDreamer (Zhang et al., 19 Apr 2024) and similar pipelines invert the process: they extract pseudo ground-truth dynamic references from powerful video diffusion models, then optimize physical parameters (typically a spatially varying Young’s modulus field and initial velocities) of the explicit simulation to match these reference motions, effectively distilling the learned motion priors of the data-driven generator into physical parameter fields.
- Physics-Conditioned Generative Models: Methods such as PhysCtrl (Wang et al., 24 Sep 2025), PhysChoreo (Zhang et al., 25 Nov 2025), PhysMaster (Ji et al., 15 Oct 2025), VideoREPA (Zhang et al., 29 May 2025), Force Prompting (Gillman et al., 26 May 2025), and VLIPP (Yang et al., 30 Mar 2025) instead learn to condition deep generative video backbones (DDPM/DiT/etc.) directly on explicit physics parameters or force signals. Supervision is provided via massive synthetic simulation datasets or via distillation of physics knowledge from self-supervised video foundation models, reinforcement learning with human feedback, or iterative self-refinement guided by large vision-LLMs (Liu et al., 25 Nov 2025).
2. System Architectures and Mathematical Foundations
State-of-the-art systems instantiate a modular pipeline generally comprising the following:
| Stage | Typical Operation | Core References |
|---|---|---|
| Scene/Geometry | SAM, Grounded-SAM, InstantMesh, 3DGS, LGM | (Zhang et al., 19 Apr 2024, Tan et al., 26 Nov 2024, Chen et al., 26 Mar 2025) |
| Physical Parameters | Material segmentation, LLM reasoning, neural fields | (Lin et al., 25 Nov 2024, Zhang et al., 25 Nov 2025, Chen et al., 26 Mar 2025) |
| Simulation | MPM, rigid-body ODE, PBD, elastoplasticity | (Tan et al., 26 Nov 2024, Zhang et al., 19 Apr 2024, Fu et al., 27 May 2024) |
| Rendering | Differentiable rasterization, diffusion-based enhancement | (Zhang et al., 19 Apr 2024, Tan et al., 26 Nov 2024, Wang et al., 24 Sep 2025) |
Key Mathematical Models (selected):
- Fixed-corotated energy for elasticity:
with Lamé parameters , .
- MPM substeps: particle-to-grid, grid update via conservation equations, grid-to-particle transfer.
- Differentiable rendering and photometric/D-SSIM losses for simulation-to-video alignment.
Generative Model Conditioning:
- Physics parameters (force vectors, material labels, elastic/plastic parameters) embedded as input tokens, spatial maps, or ControlNet branches.
- Additional losses may enforce velocity consistency, explicit physical constraint proximity, spatiotemporal token relational alignment (Zhang et al., 29 May 2025).
3. Controllability, Editing, and User Interaction Paradigms
Physics-based frameworks offer editing flexibility that far exceeds traditional video or text-to-video generation:
- Action-Conditional Generation: User-applied forces (point pokes, wind fields), drag points, or torques can be specified spatially and temporally. Pipelines like Force Prompting (Gillman et al., 26 May 2025) offer both local and global force encoding, composable with text and object selection.
- Material and Property Editing: Systems such as PhysChoreo (Zhang et al., 25 Nov 2025), PhysDreamer (Zhang et al., 19 Apr 2024), Phys4DGen (Lin et al., 25 Nov 2024), and PhysGen3D (Chen et al., 26 Mar 2025) enable global or part-based assignment or continuous editing of Young's modulus, density, Poisson's ratio, material class, or even constitutive model (elastic, viscoplastic, rigid, fluid). Editing is achieved via direct manipulation, prompt-based semantic modification, or LLM-instructed parameterization.
- Trajectory- and State-Based Editing: VLIPP (Yang et al., 30 Mar 2025) and VideoREPA (Zhang et al., 29 May 2025) allow for modification of explicit object trajectories, collision constraints, or physics priors in the generative loop, supporting retargeting, interaction specification, or physical “fix-ups” via iterative prompt refinement (Liu et al., 25 Nov 2025).
- Real-Time and Interactive Editing: Fast simulation backends, accelerated particle models, and tight integration with user interfaces permit near-instant preview and looped adjustment of actions and materials (Chen et al., 26 Mar 2025, Zhang et al., 25 Nov 2025, Zhang et al., 19 Apr 2024).
4. Evaluation: Benchmarks, Metrics, and Experimental Findings
Assessment of physics-based video generation is multi-faceted:
- Human Studies: Two-alternative-forced-choice (2AFC) and Likert-scale user studies evaluate motion realism, physics plausibility, and visual quality relative to curated baselines and synthetic or real ground truth (Zhang et al., 19 Apr 2024, Li et al., 23 May 2025, Tan et al., 26 Nov 2024, Wang et al., 24 Sep 2025, Chen et al., 26 Mar 2025).
- Automated Metrics:
- Physics IQ: aggregate success rate for test suites covering gravity, collisions, fluid/elastic/rigid dynamics; often VLM-judged (Liu et al., 25 Nov 2025, Zhang et al., 29 May 2025, Yang et al., 30 Mar 2025).
- VBench, VideoPhy, VideoPhy2, PBench-Edit: semantic and physical adherence, motion smoothness, and temporal consistency scoring (Tan et al., 26 Nov 2024, Wu et al., 5 Oct 2025).
- Motion-FID, frame-consistency, Chamfer Distance, IoU of 3D or image-space trajectories (Tan et al., 26 Nov 2024, Aira et al., 22 May 2024, Wang et al., 24 Sep 2025).
- Ablation Studies: Demonstrate the necessity of e.g., force-specific keywords in training (Gillman et al., 26 May 2025), the benefit of displacement loss and delta-velocity adjustment (Fu et al., 27 May 2024), contrastive part regularization (Zhang et al., 25 Nov 2025), and iterative prompt refinement (Liu et al., 25 Nov 2025).
Selected Table: Notable Quantitative Results
| Method | Physical Metric | Visual Metric | User Preference (%) |
|---|---|---|---|
| PhysDreamer | FVD ↓: 146.0 | FID ↓: 189.4 | 66% win (vs PhysGaussian) (Zhang et al., 19 Apr 2024) |
| WonderPlay | MotionFid: highest | VBench: top 2 | 70–80% (2AFC paper) (Li et al., 23 May 2025) |
| PhysGen | Physical: 4.14/5 | FID: 105.7 | --- |
| PhysChoreo | PC: 4.67 | VQ: 4.67 | 58% (vs Veo 3.1) |
| PhysCtrl | PC: 4.5 | VQ: 4.3 | 81% (phys), 66% (VQ) (Wang et al., 24 Sep 2025) |
5. Material and Interaction Modeling: From Segmentation to Physics Fields
High-fidelity physical response in video editing demands per-object and per-part material recognition, state reconstruction, and parameter assignment:
- Segmentation and Mesh Reconstruction: SAM/Grounded-SAM yields instance/part segmentation; mesh construction via InstantMesh or Gaussian splats; amodal completion and multi-view optimization for occluded or hidden geometry (Tan et al., 26 Nov 2024, Chen et al., 26 Mar 2025, Lin et al., 25 Nov 2024).
- Semantic and Physical Attributes: GPT-4V, CLIP, and custom vision-language pipelines extract both material categories (e.g., "metal leg," "wooden seat") and parametric priors (Young's modulus, density, friction), either globally or at the mesh/point/region level (Lin et al., 25 Nov 2024, Zhang et al., 25 Nov 2025, Chen et al., 26 Mar 2025).
- Physics Fields: Learned neural fields (e.g., triplane MLPs) encode spatially variant stiffness fields, velocity initializations, and part assignment (Zhang et al., 19 Apr 2024, Fu et al., 27 May 2024, Zhang et al., 25 Nov 2025).
6. Editing, Control, and Interactive Simulation: Capabilities and Limitations
Physics-based frameworks vastly expand the expressivity of video editing compared to text-prompted or unconditional generation:
- Supported Edits: Arbitrary time-dependent external forces, re-parametrization of material model on-the-fly (e.g., elastic viscoplastic), constraint editing (pins, welds, collision toggles), per-part text-to-physics translation (prompt property map), user-guided force field drawing.
- UI Integration: Editor control panels for object picking, force vector drawing, material sliders, scripting interfaces, and hybrid text/gesture input (Chen et al., 26 Mar 2025, Zhang et al., 25 Nov 2025, Gillman et al., 26 May 2025).
- Limitations:
- Manual segmentation and boundary condition setup may be required (Zhang et al., 19 Apr 2024).
- Simulation cost is non-negligible (e.g., 1 min/sec of video on a V100 for full MPM) (Zhang et al., 19 Apr 2024, Tan et al., 26 Nov 2024).
- Handling of complex phenomena (fluids, fracture, adhesive contacts) is still an open area—most models restrict to elastic or simple granular materials (Tan et al., 26 Nov 2024, Lin et al., 25 Nov 2024).
- Failure cases: geometry mis-estimation, inpainting errors, hallucinated deformation for under-constrained settings, occasional physics artifacts due to simulation-discriminator mismatch.
7. Outlook: Directions, Benchmarks, and Open Challenges
Emerging challenges and next steps include:
- Multi-Material and Composite Dynamics: Methods like Phys4DGen (Lin et al., 25 Nov 2024) are pioneering automated assignment of heterogeneous interior/surface properties and composition-aware simulation, but robust recognition and simulation of multi-component objects remain open problems.
- End-to-End Differentiable or RL-Guided Optimization: PhysDreamer (Zhang et al., 19 Apr 2024), PhysMaster (Ji et al., 15 Oct 2025), and VideoREPA (Zhang et al., 29 May 2025) highlight the potential of reinforcement learning, human-in-the-loop feedback, and token relation distillation to improve or directly optimize physical realism in deep generative models.
- Semantic-to-Physics and Language-Conditioned Control: Integration of robust, open-ended semantic parsing (e.g., with GPT-5) and free-form instruction-to-physics mapping dramatically lowers the barrier to realistic, physically plausible video retargeting (Zhang et al., 25 Nov 2025, Yang et al., 30 Mar 2025, Li et al., 23 May 2025).
- Benchmarks and Metrics: Community-wide adoption of physically calibrated benchmarks (Physics-IQ, VideoPhy, PBench-Edit, VBench) and adoption of VLM-based realism raters are essential for rigorous assessment (Liu et al., 25 Nov 2025, Tan et al., 26 Nov 2024, Wu et al., 5 Oct 2025).
Physics-based video generation and editing—characterized by modular simulation-reconstruction pipelines, deep integration of material and force fields, controllable generative priors, and semantic-to-action translation—is positioned to transform the landscape of interactive animation, digital content creation, and virtual world modeling, enabling editability and realism grounded in physical law rather than mere appearance priors. The continued evolution of hybrid architectures and benchmarks will be critical for advancing both physical fidelity and creative expressivity across domains.