DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors (2310.12190v2)

Published 18 Oct 2023 in cs.CV

Abstract: Animating a still image offers an engaging visual experience. Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content. To overcome this limitation, we explore the synthesis of dynamic content for open-domain images, converting them into animated videos. The key idea is to utilize the motion prior of text-to-video diffusion models by incorporating the image into the generative process as guidance. Given an image, we first project it into a text-aligned rich context representation space using a query transformer, which facilitates the video model to digest the image content in a compatible fashion. However, some visual details still struggle to be preserved in the resultant videos. To supplement with more precise image information, we further feed the full image to the diffusion model by concatenating it with the initial noises. Experimental results show that our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image. Comparative evaluation demonstrates the notable superiority of our approach over existing competitors.

Authors (9)

Jinbo Xing (19 papers)
Menghan Xia (33 papers)
Yong Zhang (660 papers)
Haoxin Chen (12 papers)
Wangbo Yu (15 papers)
Hanyuan Liu (11 papers)
Xintao Wang (132 papers)
Tien-Tsin Wong (33 papers)
Ying Shan (252 papers)

Citations (113)

View on Semantic Scholar

Summary

Overview of "DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors"

The paper presents DynamiCrafter, a method that leverages video diffusion models to animate still images across open-domain contexts. This work addresses existing limitations in image animation that are constrained by domain-specific motions, focusing instead on generating dynamic video content from a wide range of images using motion priors from text-to-video (T2V) diffusion models. The methodology incorporates a dual-stream image injection paradigm to ensure the generated videos maintain visual fidelity to the input images while injecting sufficient motion dynamics.

Methodology

DynamiCrafter operates by injecting the input image into the video generation process through two complementary streams: text-aligned context representation and visual detail guidance.

Text-aligned Context Representation: This stream utilizes a query transformer to project the input image into a text-aligned, rich context representation space, thus facilitating semantic understanding of the image's content. The representation is learned using CLIP's image encoder to extract globally aligned image features, which are then aligned with text features used in the diffusion model.
Visual Detail Guidance: This stream enhances the preservation of fine visual details by concatenating the full image with the initial noise used in the diffusion process. This ensures the resultant video maintains a strong visual similarity to the input image.

The dual-stream image injection paradigm thus balances semantic context alignment with the preservation of visual details, making the generated dynamics both plausible and adherent to the input image characteristics.

Experimental Results

The experimental evaluation demonstrates DynamiCrafter's superior performance over existing methods like VideoComposer and I2VGen-XL. The method shows notable improvement in metrics such as Frechet Video Distance (FVD), Kernel Video Distance (KVD), and Perceptual Input Conformity (PIC), affirming its ability to produce temporally coherent and visually faithful video animations from still images.

Qualitative results further illustrate the model's capability to handle complex input images and diverse text prompts, achieving competitive performance with state-of-the-art commercial solutions, such as PikaLabs and Gen-2, yet providing more detailed and coherent animation.

Implications and Future Work

The paper introduces a novel direction in open-domain image animation, significantly extending the applicability of animating still images beyond predefined motions or categories. This development opens new possibilities for creative content generation, such as storytelling applications or dynamic web content.

Future work could further investigate enhancing motion control using text prompts and refining the training paradigms for improved model adaptation with reduced resource consumption. Another potential avenue is the exploration of higher-resolution outputs and longer animation sequences, which could entail refining the underlying video diffusion models to accommodate more nuanced visual and motion artifacts.

In conclusion, this research highlights significant advancements in the utilization of video diffusion priors for image animation and sets the stage for further exploration and applications in automated dynamic content generation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/taziku_co/status/1754502655349817589

https://twitter.com/_akhaliq/status/1754908759279718602

https://twitter.com/knishimae0531/status/1754650529031229675

https://twitter.com/dominicreeves/status/1755155748789576140

YouTube

Show All Videos