Animate-X: Universal Character Image Animation with Enhanced Motion Representation (2410.10306v2)

Published 14 Oct 2024 in cs.CV

Abstract: Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X, a universal animation framework based on LDM for various character types (collectively named X), including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of LDM by simulating possible inputs in advance that may arise during inference. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A^2Bench) to evaluate the performance of Animate-X on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X compared to state-of-the-art methods.

Authors (9)

Shuai Tan (15 papers)
Biao Gong (33 papers)
Xiang Wang (279 papers)
Shiwei Zhang (180 papers)
DanDan Zheng (16 papers)
Ruobing Zheng (11 papers)
Kecheng Zheng (49 papers)
Jingdong Chen (62 papers)
Ming Yang (289 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces Animate-X, a universal framework that leverages a novel Pose Indicator combining implicit and explicit components to address misalignment in character animation.
The method integrates a 3D-UNet diffusion model with cross-attention and temporal modules, achieving significant improvements in PSNR, SSIM, LPIPS, and FID across both human and anthropomorphic datasets.
Experiments on the new A²Bench benchmark demonstrate that Animate-X preserves character identity and delivers consistent, expressive motion for non-human characters where previous methods often fail.

This paper introduces Animate-X, a framework for animating a wide variety of character images, including anthropomorphic ones common in games and entertainment, which existing methods often struggle with. The core problem addressed is the difficulty of generalizing character animation models trained primarily on human datasets to non-human characters with different body structures (e.g., disproportionate heads, missing limbs). Current methods often fail because their motion representations, typically derived solely from pose keypoints, are insufficient and lead to poor trade-offs between maintaining the character's identity and accurately following the driving motion, often resulting in distortions or imposing human-like features inappropriately.

To overcome these limitations, Animate-X proposes an enhanced motion representation strategy called the Pose Indicator, which operates within a Latent Diffusion Model (LDM) framework using a 3D-UNet architecture. The Pose Indicator has two components:

Implicit Pose Indicator (IPI): This component aims to capture the "gist" of the motion from the driving video beyond simple keypoints. It uses a pre-trained CLIP image encoder to extract visual features ( $f^d_\varphi$ ) from the driving video frames. These features, which implicitly encode motion patterns, dynamics, and temporal relationships, are processed by a lightweight extractor module (composed of cross-attention and FFN layers). The query ( $Q$ ) for the cross-attention combines embeddings from the detected DWPose keypoints ( $q_p$ ) and a learnable query vector ( $q_l$ ), allowing it to capture both explicit pose information and implicit motion cues from the CLIP features ( $K, V$ ). The output ( $f_i$ ) serves as an implicit motion condition for the diffusion model.
Explicit Pose Indicator (EPI): This component tackles the issue of misalignment between the reference character's shape and the driving pose sequence, which often occurs during inference, especially with non-human characters. EPI enhances the pose encoder's robustness by simulating such misalignments during training. It uses two main techniques:
- Pose Realignment: A driving pose ( $I^p$ ) is aligned to a randomly sampled anchor pose ( $I^p_{anchor}$ ) from a pool, resulting in an aligned pose ( $I^p_{realign}$ ) that retains the original motion but adopts the anchor's body shape characteristics.
- Pose Rescale: Further transformations (like altering limb lengths, head size, removing/adding parts) are randomly applied to the realigned pose ( $I^p_{realign}$ ) with a high probability ( $\lambda > 98\%$ ) to create the final transformed pose ( $I^p_n$ ). This explicitly trains the model to handle diverse body shapes and potential inaccuracies in pose estimation for non-human characters. The transformed pose is encoded into an explicit pose feature ( $f_e$ ).

The overall Animate-X framework integrates these components:

A reference image $I^r$ provides appearance features via a CLIP encoder ( $f^r_{\varphi}$ ) and latent features via a VAE encoder ( $f^r_e$ ).
A driving video $I^d_{1:F}$ provides motion information processed by IPI ( $f_i$ ) and EPI ( $f_e$ ).
The 3D-UNet diffusion model ( $\epsilon_\theta$ ) takes the noised latent variable, the explicit pose feature $f_e$ , and the reference latent feature $f^r_e$ as input.
Appearance conditioning ( $f^r_{\varphi}$ ) and implicit motion conditioning ( $f_i$ ) are injected via cross-attention within the UNet's Spatial Attention and a dedicated Motion Attention module, respectively.
Temporal consistency is handled by Mamba-based Temporal Attention modules.
The final denoised latent $z_0$ is decoded by the VAE decoder $\mathcal{D}$ to produce the output video.

To evaluate performance, particularly on anthropomorphic characters, the authors introduce a new benchmark, $A^2$ Bench (Animated Anthropomorphic Benchmark). This benchmark contains 500 anthropomorphic character images and corresponding dance videos, generated using GPT-4 for prompts and KLing AI for image/video synthesis.

Experiments demonstrate Animate-X's superiority over state-of-the-art methods (like Animate Anyone, Unianimate, MimicMotion, ControlNeXt, MusePose) on both $A^2$ Bench and standard human datasets (TikTok, Fashion). Quantitative results show better performance across metrics like PSNR, SSIM, LPIPS, FID, FID-VID, and FVD, especially in a challenging cross-driven setting designed for $A^2$ Bench where poses are intentionally misaligned. Qualitative results and a user paper further confirm that Animate-X excels at preserving character identity while achieving consistent and expressive motion, particularly for non-human characters where other methods often fail or introduce artifacts. Ablation studies validate the effectiveness of both IPI and EPI components.

The main contributions are:

Animate-X, a universal animation framework generalizing to diverse characters (X), including anthropomorphic ones.
The Pose Indicator (IPI + EPI) for enhanced, robust motion representation handling misalignment.
The $A^2$ Bench dataset for evaluating anthropomorphic character animation.

Limitations include insufficient modeling of hands/faces and lack of real-time capability. Future work aims to address these and paper character-environment interactions. Ethical considerations regarding potential misuse for generating misleading content are also acknowledged.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1846034167160574058

https://twitter.com/syuyakimura/status/1846342553760944376

https://twitter.com/AILucknow/status/1846243137670599165

https://twitter.com/chongdashu/status/1846538458686267649

https://twitter.com/arXivGPT/status/1846624574638559666

https://twitter.com/oopewei/status/1847578395498136001

YouTube

Show All Videos