PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation (2409.18964v1)

Published 27 Sep 2024 in cs.CV, cs.AI, and cs.LG

Abstract: We present PhysGen, a novel image-to-video generation method that converts a single image and an input condition (e.g., force and torque applied to an object in the image) to produce a realistic, physically plausible, and temporally consistent video. Our key insight is to integrate model-based physical simulation with a data-driven video generation process, enabling plausible image-space dynamics. At the heart of our system are three core components: (i) an image understanding module that effectively captures the geometry, materials, and physical parameters of the image; (ii) an image-space dynamics simulation model that utilizes rigid-body physics and inferred parameters to simulate realistic behaviors; and (iii) an image-based rendering and refinement module that leverages generative video diffusion to produce realistic video footage featuring the simulated motion. The resulting videos are realistic in both physics and appearance and are even precisely controllable, showcasing superior results over existing data-driven image-to-video generation works through quantitative comparison and comprehensive user study. PhysGen's resulting videos can be used for various downstream applications, such as turning an image into a realistic animation or allowing users to interact with the image and create various dynamics. Project page: https://stevenlsw.github.io/physgen/

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a novel framework that fuses rigid-body physics simulation with deep generative models for realistic image-to-video conversion.
It employs a three-module system—perception, dynamics simulation, and rendering—to accurately model object properties and interactions.
Evaluation reveals superior physical realism with high human ratings and improved Motion-FID scores compared to state-of-the-art methods.

PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation

PhysGen presents an innovative image-to-video generation framework designed to produce realistic, physically plausible videos from single images. The methodology integrates model-based physical simulation with data-driven video generation to achieve this goal. Specifically, it comprises three primary modules: perception, dynamics simulation, and rendering. These modules work synergistically to ensure both the physical accuracy and the visual coherence of the generated videos.

Approach

Perception Module

The perception module enables the system to understand the physical, geometric, and material properties of objects within the image. This is achieved using large pre-trained models such as GPT-4V and Grounded-SAM for object recognition and segmentation. The module also reasons about physical properties (i.e., mass, friction, and elasticity) using visual prompting techniques. The system then converts these segmented objects into vectorized shape primitives — either circles or polygons — for physical simulation. Additionally, albedo and normal maps, as well as scene lighting parameters, are estimated to facilitate realistic rendering.

Dynamics Simulation Module

The dynamics simulation uses a 2D rigid body physics simulator to model the motion of objects based on inferred physical properties and user-specified initial conditions (e.g., force and torque). The simulation operates in image space, taking advantage of simpler computational requirements while still maintaining a high degree of physical realism. Each object's motion is computed by integrating the forces and torques acting upon it, and the simulation accounts for interactions such as collisions, friction, and elasticity.

Rendering Module

To create the final video, the rendering module first composites the foreground objects against a static inpainted background. This intermediate video is refined through relighting to account for changes in shading and shadows due to object motion. Finally, a latent diffusion model is employed to enhance the visual realism of the composite video, correcting for any artifacts introduced in earlier steps and ensuring temporal coherence.

Results and Evaluation

Visual Comparisons

The generated videos are compared against state-of-the-art image-to-video (I2V) models, including SEINE, DynamiCrafter, and I2VGen-XL. The results demonstrate that PhysGen clearly surpasses these models in generating videos with physically plausible dynamics. Unlike the purely data-driven approaches, PhysGen incorporates explicit physical laws, which results in more accurate and realistic motion sequences.

Human Evaluation

A user paper was conducted to evaluate both the physical realism and photo realism of the generated videos. Participants rated the videos on a five-point scale, and PhysGen achieved the highest ratings among all compared methods, with an average score of 4.14 for physical realism and 3.86 for photo realism.

Quantitative Metrics

The system was also evaluated using Fréchet Inception Distance (FID) and Motion-FID metrics against a corpus of ground truth videos. PhysGen achieved lower Motion-FID scores compared to the baseline methods, which indicates its superior performance in accurately replicating realistic object motions.

Analysis and Future Directions

Numerical Performance

PhysGen demonstrates strong numerical performance in both human evaluation metrics and standardized quantitative metrics. Its reliance on explicit physical simulation allows it to create dynamic interactions that purely generative models fail to replicate accurately.

Implications and Future Work

The implications of PhysGen lie in advancing the fidelity and control of image-to-video generation systems. This approach opens new possibilities for interactive applications, animation, and simulation tasks that require high physical accuracy. Future developments could extend PhysGen to handle more complex interactions involving non-rigid bodies and full 3D motion, which would further broaden its applicability.

Overall, PhysGen represents a significant step towards integrating physics-based reasoning into video generation, providing a robust and flexible framework for creating realistic animations from static images.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1840614653530980449

https://twitter.com/stevenpg8/status/1840857643918713173

https://twitter.com/fffiloni/status/1841471056860635435

YouTube

Show All Videos

Reddit

[2409.18964] PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation (1 point, 0 comments)