Learning Effective Physical Representations for Video Generation

Develop a principled methodology to learn an effective physical representation suitable for conditioning video generation models, given the absence of a well-established definition of physical representation and the lack of straightforward supervision signals for training such a representation.

Background

The paper aims to inject physics-awareness into video generation by learning a physical representation from the input image and using it to guide a diffusion-based image-to-video model. The authors emphasize that the input image contains both explicit physical states (materials, spatial distributions) and implicit physical laws (e.g., gravity), making representation learning a key bridge between physical knowledge and generated dynamics.

However, the authors state that there is no well-established definition of a physical representation for this purpose, which prevents straightforward supervision or the use of off-the-shelf extractors. To address this challenge, they propose a top-down optimization strategy: training a dedicated encoder (PhysEncoder) via reinforcement learning with human feedback (DPO) using the physical plausibility of generated videos as the optimization signal. This approach is presented as a step toward resolving the open question of how to learn such representations effectively.

References

However, how to learn an effective physical representation for video generation remains an open question.

— PhysMaster: Mastering Physical Representation for Video Generation via Reinforcement Learning (2510.13809 - Ji et al., 15 Oct 2025) in Section 1 (Introduction)

Learning Effective Physical Representations for Video Generation

Background

References

Related Problems