- The paper introduces a novel framework that fuses rigid-body physics simulation with deep generative models for realistic image-to-video conversion.
- It employs a three-module system—perception, dynamics simulation, and rendering—to accurately model object properties and interactions.
- Evaluation reveals superior physical realism with high human ratings and improved Motion-FID scores compared to state-of-the-art methods.
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation
PhysGen presents an innovative image-to-video generation framework designed to produce realistic, physically plausible videos from single images. The methodology integrates model-based physical simulation with data-driven video generation to achieve this goal. Specifically, it comprises three primary modules: perception, dynamics simulation, and rendering. These modules work synergistically to ensure both the physical accuracy and the visual coherence of the generated videos.
Approach
Perception Module
The perception module enables the system to understand the physical, geometric, and material properties of objects within the image. This is achieved using large pre-trained models such as GPT-4V and Grounded-SAM for object recognition and segmentation. The module also reasons about physical properties (i.e., mass, friction, and elasticity) using visual prompting techniques. The system then converts these segmented objects into vectorized shape primitives — either circles or polygons — for physical simulation. Additionally, albedo and normal maps, as well as scene lighting parameters, are estimated to facilitate realistic rendering.
Dynamics Simulation Module
The dynamics simulation uses a 2D rigid body physics simulator to model the motion of objects based on inferred physical properties and user-specified initial conditions (e.g., force and torque). The simulation operates in image space, taking advantage of simpler computational requirements while still maintaining a high degree of physical realism. Each object's motion is computed by integrating the forces and torques acting upon it, and the simulation accounts for interactions such as collisions, friction, and elasticity.
Rendering Module
To create the final video, the rendering module first composites the foreground objects against a static inpainted background. This intermediate video is refined through relighting to account for changes in shading and shadows due to object motion. Finally, a latent diffusion model is employed to enhance the visual realism of the composite video, correcting for any artifacts introduced in earlier steps and ensuring temporal coherence.
Results and Evaluation
Visual Comparisons
The generated videos are compared against state-of-the-art image-to-video (I2V) models, including SEINE, DynamiCrafter, and I2VGen-XL. The results demonstrate that PhysGen clearly surpasses these models in generating videos with physically plausible dynamics. Unlike the purely data-driven approaches, PhysGen incorporates explicit physical laws, which results in more accurate and realistic motion sequences.
Human Evaluation
A user paper was conducted to evaluate both the physical realism and photo realism of the generated videos. Participants rated the videos on a five-point scale, and PhysGen achieved the highest ratings among all compared methods, with an average score of 4.14 for physical realism and 3.86 for photo realism.
Quantitative Metrics
The system was also evaluated using Fréchet Inception Distance (FID) and Motion-FID metrics against a corpus of ground truth videos. PhysGen achieved lower Motion-FID scores compared to the baseline methods, which indicates its superior performance in accurately replicating realistic object motions.
Analysis and Future Directions
PhysGen demonstrates strong numerical performance in both human evaluation metrics and standardized quantitative metrics. Its reliance on explicit physical simulation allows it to create dynamic interactions that purely generative models fail to replicate accurately.
Implications and Future Work
The implications of PhysGen lie in advancing the fidelity and control of image-to-video generation systems. This approach opens new possibilities for interactive applications, animation, and simulation tasks that require high physical accuracy. Future developments could extend PhysGen to handle more complex interactions involving non-rigid bodies and full 3D motion, which would further broaden its applicability.
Overall, PhysGen represents a significant step towards integrating physics-based reasoning into video generation, providing a robust and flexible framework for creating realistic animations from static images.