Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics (2408.04631v1)

Published 8 Aug 2024 in cs.CV and cs.AI

Abstract: We present Puppet-Master, an interactive video generative model that can serve as a motion prior for part-level dynamics. At test time, given a single image and a sparse set of motion trajectories (i.e., drags), Puppet-Master can synthesize a video depicting realistic part-level motion faithful to the given drag interactions. This is achieved by fine-tuning a large-scale pre-trained video diffusion model, for which we propose a new conditioning architecture to inject the dragging control effectively. More importantly, we introduce the all-to-first attention mechanism, a drop-in replacement for the widely adopted spatial attention modules, which significantly improves generation quality by addressing the appearance and background issues in existing models. Unlike other motion-conditioned video generators that are trained on in-the-wild videos and mostly move an entire object, Puppet-Master is learned from Objaverse-Animation-HQ, a new dataset of curated part-level motion clips. We propose a strategy to automatically filter out sub-optimal animations and augment the synthetic renderings with meaningful motion trajectories. Puppet-Master generalizes well to real images across various categories and outperforms existing methods in a zero-shot manner on a real-world benchmark. See our project page for more results: vgg-puppetmaster.github.io.

Authors (4)

Ruining Li (10 papers)
Chuanxia Zheng (32 papers)
Christian Rupprecht (90 papers)
Andrea Vedaldi (195 papers)

Citations (2)

View on Semantic Scholar

Summary

Overview of "Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics"

In the paper titled "Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics," the authors from the Visual Geometry Group at the University of Oxford present an innovative approach to interactive video generation. The key contribution of this work is the development of Puppet-Master, a motion-conditioned video generative model that can synthesize realistic part-level dynamics from a single image and sparse motion trajectories. This is achieved through fine-tuning a large-scale pre-trained video diffusion model and introducing novel conditioning architectures.

Key Contributions

The contributions of this work can be summarized as follows:

New Conditioning Architecture:
- The authors propose an adaptive layer normalization mechanism and drag tokens within cross-attention modules to effectively incorporate motion controls (drags) into the video generation pipeline.
- The introduction of the all-to-first attention mechanism significantly improves the generation quality, addressing appearance and background issues. This mechanism creates a shortcut allowing information to propagate directly from the conditioning frame to subsequent frames, enhancing the quality of generated sequences.
Curated Dataset for Training:
- A new dataset, Objaverse-Animation-HQ, is introduced, containing high-quality part-level motion clips. This dataset is curated from Objaverse, a large-scale repository of animated 3D assets, employing both automatic filtering and GPT-4V verification to ensure realistic and meaningful animations.
Empirical Validation:
- Puppet-Master is evaluated against existing methods such as DragNUWA and DragAnything, demonstrating superior performance on multiple benchmarks, including Drag-a-Move and Human3.6M, even in zero-shot scenarios.
- Empirical results show that Puppet-Master generalizes well to real images across various categories, outstripping prior methods by generating videos with physically plausible part-level motion that respect the input constraints.

Methodology

Overview of Stable Video Diffusion (SVD)

Stable Video Diffusion is the foundational model from which Puppet-Master builds. SVD is a pre-trained video generator that operates in latent space to generate frames based on a single reference image. It uses a diffusion process trained on Internet videos, ensuring it captures broad motion priors.

Drag-Based Conditioning

The core advancement in Puppet-Master is incorporating drag-based control into SVD:

Drag Encoding:
- Each drag is encoded through a multi-resolution function to capture its spatial and temporal characteristics across frames.
Drag Modulation:
- Drag conditioning is introduced via adaptive normalization layers which scale and shift the features appropriately during the generation process.
Drag Tokens:
- Adding drag tokens in the cross-attention modules allows spatially aware conditioning by enabling the model to attend to multiple drag points.
All-to-First Attention:
- Improving upon the spatial awareness and coherence of generated frames, this mechanism allows every frame to attend back to the first frame, ensuring consistent propagation of high-quality appearance details.

Data Curation and Training

Data Curation Strategy

The authors employ a two-tier filtering system:

Automatic Filtering:
- Using metrics derived from motion trajectories to remove trivial or unrealistic animations, resulting in the Objaverse-Animation dataset.
Verification with GPT-4V:
- Further filtering through GPT-4V to ensure remaining animations depict realistic and plausible motions, leading to the creation of Objaverse-Animation-HQ.

Training Framework

Puppet-Master is fine-tuned on the curated dataset, with adaptive mechanisms configured to optimize the generative quality and the fidelity of motion trajectories.

Experimental Evaluation

Quantitative and Qualitative Results

Both quantitative metrics (PSNR, SSIM, LPIPS, FVD, and flow error) and qualitative analyses demonstrate Puppet-Master’s superiority over existing models. Notably:

Higher Fidelity:
- Puppet-Master shows improved metrics across in-domain and out-of-domain datasets, including higher PSNR and lower FVD, suggesting better visual and temporal coherence.
Real-World Generalization:
- The model exhibits remarkable zero-shot generalization capabilities, performing well on diverse real-world datasets without specific fine-tuning on real video data.

Future Implications

The advancements introduced pave the way for further developments in AI-driven video generation:

Practical Applications:
- Potential applications span various domains including animation, virtual reality, and robotics where precise motion control is essential.
Theoretical Developments:
- Future research may explore more sophisticated motion priors, refined conditioning mechanisms, and broader generalization to more complex and chaotic motion patterns.

Conclusion

"Puppet-Master" marks a substantial step in interactive video generation, particularly in capturing and propagating fine-grained part-level dynamics in generated videos. The approaches developed and the resulting datasets provide a robust foundation for future research, promising enhanced capabilities in AI-powered animation and motion synthesis. The paper's contributions lie not just in improved generative quality but also in pushing the boundaries of what such models can achieve in understanding and replicating intricate motion dynamics.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1821725713377951934

https://twitter.com/arXivGPT/status/1823109352410218727

https://twitter.com/javaeeeee1/status/1822297809560772780

https://twitter.com/arXivGPT/status/1822384221618352141

https://twitter.com/arXivGPT/status/1822746880675029398

Reddit

[2408.04631] Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics (1 point, 0 comments)