ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction (2504.21855v1)

Published 30 Apr 2025 in cs.CV

Abstract: In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D physical knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized physical prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D physical knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

ReVision: Enhancing Video Generation with Explicit 3D Physics Modeling

The paper "ReVision: High-Quality, Low-Cost Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction" introduces a novel framework named ReVision to address the challenges in video generation, specifically for scenarios involving complex motions and interactions. This work enhances existing pre-trained video generation models by incorporating parameterized 3D physics, resulting in improved realism and controllability of generated videos.

The ReVision framework consists of a three-stage pipeline designed to leverage 3D physical knowledge without significantly retraining existing models. The first stage uses a video diffusion model to generate a coarse video based on input conditions. The second stage extracts a set of 2D and 3D features from this coarse video, constructing a 3D object-centric representation. This is refined using a parameterized physical prior model (PPPM) to produce an accurate 3D motion sequence. The final stage involves feeding this refined sequence back into the video diffusion model as additional conditioning, thus ensuring the production of motion-consistent and realistic videos.

A salient achievement of the ReVision framework is demonstrated in its application on the Stable Video Diffusion (SVD) model. Notably, ReVision achieves superior motion fidelity and coherence with significantly fewer parameters—only 1.5 billion compared to a competing model with over 13 billion parameters. This suggests that incorporating 3D physical knowledge can effectively augment video diffusion models, enabling them to generate videos that accurately reflect complex real-world dynamics.

Moreover, the framework employs a pragmatic approach to conditioning video generation on implicit motion information. Unlike traditional methods that rely on 2D keypoint trajectories, ReVision’s use of parameterized 3D models, such as SMPL-X for humans and SMAL for animals, offers a robust solution for modeling realistic physical interactions. This object-centric method provides a comprehensive understanding of complicated scenes where maintaining physical plausibility is crucial. Importantly, the PPPM plays a pivotal role by optimizing these 3D features, using motion strength derived from parameter differences across frames and textual embedding for further refinement.

The practical implications of ReVision are far-reaching. It addresses several limitations faced by current video generation techniques, such as handling morphological failures or inconsistency in object structures. This framework may prove valuable in applications ranging from the entertainment industry to more specialized areas requiring high-fidelity simulations of human and animal movements.

Beside practical gains, ReVision also contributes to the theoretical understanding of integrating physical knowledge with generative models. By showing that it is possible to significantly reduce the parameter size while improving performance, it opens new avenues for research in efficient model architectures. This introduces a promising direction for future developments in AI, where model complexity does not necessarily correlate with quality, provided the architecture is effectively leveraged.

Looking ahead, there are still challenges to address. The reliance on off-the-shelf 3D mesh models and the potential difficulty in generating high-resolution details such as fingers or complex object shapes in videos necessitates further research. Nevertheless, the ReVision framework offers a substantial step forward in the quest for high-quality, cost-effective video generation, laying the groundwork for future advancements in this domain.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (5)

HackerNews

ReVision: Video Generation with Explicit 3D Physics Modeling for Complex Motion (1 point, 0 comments)