MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model (2405.20222v3)

Published 30 May 2024 in cs.CV and cs.AI

Abstract: We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations. This is different from previous methods which only can work on a specific motion domain or show weak control abilities with diffusion prior. To achieve our goal, we design several domain-aware motion field adapters (\ie, MOFA-Adapters) to control the generated motions in the video generation pipeline. For MOFA-Adapters, we consider the temporal motion consistency of the video and generate the dense motion flow from the given sparse control conditions first, and then, the multi-scale features of the given image are wrapped as a guided feature for stable video diffusion generation. We naively train two motion adapters for the manual trajectories and the human landmarks individually since they both contain sparse information about the control. After training, the MOFA-Adapters in different domains can also work together for more controllable video generation. Project Page: https://myniuuu.github.io/MOFA_Video/

Citations (13)

View on Semantic Scholar

Summary

The paper presents a unified framework that combines Stable Video Diffusion with MOFA-Adapters for enhanced controllable image animation.
It introduces a novel sparse-to-dense motion generation method that improves video quality and temporal consistency.
Extensive experiments show lower LPIPS and FID scores, demonstrating superior performance over existing animation methods.

Overview of MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptations

The MOFA-Video framework represents a significant step forward in the domain of controllable image animation, offering an innovative approach to generating videos from images by leveraging generative motion field adaptations. This framework introduces a distinct methodology, notably diverging from previous approaches that commonly operated within specific motion domains and displayed limited control capabilities. MOFA-Video brings to the table a unified framework that amalgamates advantages from both in-domain image animations and open-domain image-to-video generation models.

Main Contributions

Unified Framework for Animation: The core offering of MOFA-Video lies in its unified framework for controllable image animation using the Stable Video Diffusion (SVD) model, significantly enhancing the control over diverse motion domains, including manual trajectories and human landmarks.
MOFA-Adapter: The paper introduces MOFA-Adapters, a novel network structure that employs sparse motion hints to guide video diffusion. These adapters cater to different motion control signals within the video generation process, facilitating domain-specific animation tasks through their design, while also allowing for their combination for more complex control scenarios.
Sparse-to-Dense Motion Generation: The MOFA-Adapters integrate sparse-to-dense (S2D) motion generation, a crucial step that allows the interpolation of sparse motion hints into dense motion fields. This provides a balance between sparse guidance and motion synthesis, enhancing the quality and temporal consistency of the generated videos.

Experimental Results

The paper substantiates its claims through extensive experiments demonstrating the superiority of MOFA-Video over existing state-of-the-art methods for various applications:

Trajectories and Facial Animation: Through detailed experimental setups, it is shown that MOFA-Video can effectively animate images with trajectories and facial landmarks, surpassing previous methods such as DragNUWA and achieving better visual quality with less artifacting.
Metrics: Quantitative results reveal that MOFA-Video achieves lower LPIPS and FID scores, indicating higher perceptual similarity to real motion, and improved motion fidelity as compared to previous methods.

Practical Implications

The introduction of MOFA-Video opens up several practical avenues in animation applications. By supporting meticulous control across various motion domains and synthesizing coherent video content from static images, this framework could be particularly beneficial in the fields of digital media, entertainment, and educational content creation, where customizable animations based on static images are desirable.

Future Directions

The potential future directions include expanding the framework's capabilities to handle even larger motion domains with increased precision and robustness. Additionally, leveraging the fusion of multiple MOFA-Adapters points towards the development of more complex, multi-modal animation models that could integrate diverse input types beyond static images, such as dynamic textures or volumetric data.

Conclusion

MOFA-Video exemplifies a comprehensive approach to controllable image-to-video animation, combining theoretical innovations with practical enhancements. This framework is poised to significantly impact the field by providing an adaptable and user-friendly toolkit for animating images, setting a new benchmark in animation technology. Further exploration of integrating more complex motion models and potentially extending these techniques to non-rigid transformations or other image modalities remain promising avenues for research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/taziku_co/status/1805193170684166610

https://twitter.com/sambhavgupta6/status/1805619204873404756

https://twitter.com/_vztu/status/1811867073150411117

https://twitter.com/tsevis/status/1800894610597965962

https://twitter.com/javaeeeee1/status/1796899183980409066

YouTube

Show All Videos