Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Animate-X: Universal Character Image Animation with Enhanced Motion Representation (2410.10306v2)

Published 14 Oct 2024 in cs.CV

Abstract: Character image animation, which generates high-quality videos from a reference image and target pose sequence, has seen significant progress in recent years. However, most existing methods only apply to human figures, which usually do not generalize well on anthropomorphic characters commonly used in industries like gaming and entertainment. Our in-depth analysis suggests to attribute this limitation to their insufficient modeling of motion, which is unable to comprehend the movement pattern of the driving video, thus imposing a pose sequence rigidly onto the target character. To this end, this paper proposes Animate-X, a universal animation framework based on LDM for various character types (collectively named X), including anthropomorphic characters. To enhance motion representation, we introduce the Pose Indicator, which captures comprehensive motion pattern from the driving video through both implicit and explicit manner. The former leverages CLIP visual features of a driving video to extract its gist of motion, like the overall movement pattern and temporal relations among motions, while the latter strengthens the generalization of LDM by simulating possible inputs in advance that may arise during inference. Moreover, we introduce a new Animated Anthropomorphic Benchmark (A2Bench) to evaluate the performance of Animate-X on universal and widely applicable animation images. Extensive experiments demonstrate the superiority and effectiveness of Animate-X compared to state-of-the-art methods.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Shuai Tan (15 papers)
  2. Biao Gong (33 papers)
  3. Xiang Wang (279 papers)
  4. Shiwei Zhang (180 papers)
  5. DanDan Zheng (16 papers)
  6. Ruobing Zheng (11 papers)
  7. Kecheng Zheng (49 papers)
  8. Jingdong Chen (62 papers)
  9. Ming Yang (289 papers)
Citations (3)

Summary

  • The paper introduces Animate-X, a universal framework that leverages a novel Pose Indicator combining implicit and explicit components to address misalignment in character animation.
  • The method integrates a 3D-UNet diffusion model with cross-attention and temporal modules, achieving significant improvements in PSNR, SSIM, LPIPS, and FID across both human and anthropomorphic datasets.
  • Experiments on the new A²Bench benchmark demonstrate that Animate-X preserves character identity and delivers consistent, expressive motion for non-human characters where previous methods often fail.

This paper introduces Animate-X, a framework for animating a wide variety of character images, including anthropomorphic ones common in games and entertainment, which existing methods often struggle with. The core problem addressed is the difficulty of generalizing character animation models trained primarily on human datasets to non-human characters with different body structures (e.g., disproportionate heads, missing limbs). Current methods often fail because their motion representations, typically derived solely from pose keypoints, are insufficient and lead to poor trade-offs between maintaining the character's identity and accurately following the driving motion, often resulting in distortions or imposing human-like features inappropriately.

To overcome these limitations, Animate-X proposes an enhanced motion representation strategy called the Pose Indicator, which operates within a Latent Diffusion Model (LDM) framework using a 3D-UNet architecture. The Pose Indicator has two components:

  1. Implicit Pose Indicator (IPI): This component aims to capture the "gist" of the motion from the driving video beyond simple keypoints. It uses a pre-trained CLIP image encoder to extract visual features (fφdf^d_\varphi) from the driving video frames. These features, which implicitly encode motion patterns, dynamics, and temporal relationships, are processed by a lightweight extractor module (composed of cross-attention and FFN layers). The query (QQ) for the cross-attention combines embeddings from the detected DWPose keypoints (qpq_p) and a learnable query vector (qlq_l), allowing it to capture both explicit pose information and implicit motion cues from the CLIP features (K,VK, V). The output (fif_i) serves as an implicit motion condition for the diffusion model.
  2. Explicit Pose Indicator (EPI): This component tackles the issue of misalignment between the reference character's shape and the driving pose sequence, which often occurs during inference, especially with non-human characters. EPI enhances the pose encoder's robustness by simulating such misalignments during training. It uses two main techniques:
    • Pose Realignment: A driving pose (IpI^p) is aligned to a randomly sampled anchor pose (IanchorpI^p_{anchor}) from a pool, resulting in an aligned pose (IrealignpI^p_{realign}) that retains the original motion but adopts the anchor's body shape characteristics.
    • Pose Rescale: Further transformations (like altering limb lengths, head size, removing/adding parts) are randomly applied to the realigned pose (IrealignpI^p_{realign}) with a high probability (λ>98%\lambda > 98\%) to create the final transformed pose (InpI^p_n). This explicitly trains the model to handle diverse body shapes and potential inaccuracies in pose estimation for non-human characters. The transformed pose is encoded into an explicit pose feature (fef_e).

The overall Animate-X framework integrates these components:

  • A reference image IrI^r provides appearance features via a CLIP encoder (fφrf^r_{\varphi}) and latent features via a VAE encoder (ferf^r_e).
  • A driving video I1:FdI^d_{1:F} provides motion information processed by IPI (fif_i) and EPI (fef_e).
  • The 3D-UNet diffusion model (ϵθ\epsilon_\theta) takes the noised latent variable, the explicit pose feature fef_e, and the reference latent feature ferf^r_e as input.
  • Appearance conditioning (fφrf^r_{\varphi}) and implicit motion conditioning (fif_i) are injected via cross-attention within the UNet's Spatial Attention and a dedicated Motion Attention module, respectively.
  • Temporal consistency is handled by Mamba-based Temporal Attention modules.
  • The final denoised latent z0z_0 is decoded by the VAE decoder D\mathcal{D} to produce the output video.

To evaluate performance, particularly on anthropomorphic characters, the authors introduce a new benchmark, A2A^2Bench (Animated Anthropomorphic Benchmark). This benchmark contains 500 anthropomorphic character images and corresponding dance videos, generated using GPT-4 for prompts and KLing AI for image/video synthesis.

Experiments demonstrate Animate-X's superiority over state-of-the-art methods (like Animate Anyone, Unianimate, MimicMotion, ControlNeXt, MusePose) on both A2A^2Bench and standard human datasets (TikTok, Fashion). Quantitative results show better performance across metrics like PSNR, SSIM, LPIPS, FID, FID-VID, and FVD, especially in a challenging cross-driven setting designed for A2A^2Bench where poses are intentionally misaligned. Qualitative results and a user paper further confirm that Animate-X excels at preserving character identity while achieving consistent and expressive motion, particularly for non-human characters where other methods often fail or introduce artifacts. Ablation studies validate the effectiveness of both IPI and EPI components.

The main contributions are:

  • Animate-X, a universal animation framework generalizing to diverse characters (X), including anthropomorphic ones.
  • The Pose Indicator (IPI + EPI) for enhanced, robust motion representation handling misalignment.
  • The A2A^2Bench dataset for evaluating anthropomorphic character animation.

Limitations include insufficient modeling of hands/faces and lack of real-time capability. Future work aims to address these and paper character-environment interactions. Ethical considerations regarding potential misuse for generating misleading content are also acknowledged.

Youtube Logo Streamline Icon: https://streamlinehq.com