MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model (2311.16498v1)

Published 27 Nov 2023 in cs.CV and cs.GR

Abstract: This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results, these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work, we introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity. To achieve this, we first develop a video diffusion model to encode temporal information. Second, to maintain the appearance coherence across frames, we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations, we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably, our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available.

PDF Abstract

Human image animation, which aims to bring static images to life by making them perform realistic movements, has significant potential in various industries, including social media, entertainment, and film. A critical challenge faced by current animation techniques is generating videos that not only move convincingly but also maintain the identity and details of the original static image – a problem that is amplified when actions span across multiple frames.

To navigate this challenge, researchers have introduced a new diffusion-based framework named MagicAnimate, which excels in producing temporally consistent human animations while preserving the intricate details from the reference image, such as appearance and background. MagicAnimate incorporates three key components: a video diffusion model for encoding temporal coherence, a novel appearance encoder to retain the reference image's details, and a straightforward video fusion technique to ensure smooth transitions in extended video sequences.

The video diffusion model lies at the heart of MagicAnimate and leverages temporal attention blocks within the network, which allow the model to account for information from neighboring frames and generate a sequence that flows naturally over time. Where previous methods might process each frame independently, MagicAnimate's approach to integrating temporal information is essential for achieving continuous, flicker-free motion.

In the traditional animation pipeline, a source image is often distorted or manipulated according to generated motion signals, which can result from 3D meshes or 2D flow fields. MagicAnimate distinguishes itself by encoding motion using DensePose sequences – a robust motion representation that offers detailed motion information and thereby enhances the animation accuracy.

MagicAnimate also introduces a novel appearance encoder, which, unlike previous models that relied on sparse and high-level semantic features, captures dense visual features of the reference image to guide the animation. This not only preserves the identity of the individual being animated but also retains the distinct characteristics of the background, clothing, and accessories.

To further bolster the animation quality across longer sequences, MagicAnimate employs a video fusion technique during the inference stage. It cleverly divides a long motion sequence into overlapping segments and blends them to smooth over any discontinuities, ensuring the resulting animation transitions seamlessly from one action to the next.

Empirical evidence supporting MagicAnimate's superiority comes from its performance on two challenging benchmarks. The framework demonstrated remarkable improvements in video fidelity and outperformed previous state-of-the-art methods significantly, including over a 38% increase in video quality in the challenging TikTok dancing dataset. Additionally, MagicAnimate proved capable of animating across different identities and adapting to unseen domains, suggesting its robustness and versatility.

In conclusion, MagicAnimate is a significant step forward in the domain of human image animation. It addresses long-standing challenges of temporal consistency and detail preservation and opens up new possibilities for high-fidelity human avatar generation across a variety of applications. The code and model for MagicAnimate will be made available for use and further research, potentially spurring advancements in the field.