- The paper introduces Animate-X, a universal framework that leverages a novel Pose Indicator combining implicit and explicit components to address misalignment in character animation.
- The method integrates a 3D-UNet diffusion model with cross-attention and temporal modules, achieving significant improvements in PSNR, SSIM, LPIPS, and FID across both human and anthropomorphic datasets.
- Experiments on the new A²Bench benchmark demonstrate that Animate-X preserves character identity and delivers consistent, expressive motion for non-human characters where previous methods often fail.
This paper introduces Animate-X, a framework for animating a wide variety of character images, including anthropomorphic ones common in games and entertainment, which existing methods often struggle with. The core problem addressed is the difficulty of generalizing character animation models trained primarily on human datasets to non-human characters with different body structures (e.g., disproportionate heads, missing limbs). Current methods often fail because their motion representations, typically derived solely from pose keypoints, are insufficient and lead to poor trade-offs between maintaining the character's identity and accurately following the driving motion, often resulting in distortions or imposing human-like features inappropriately.
To overcome these limitations, Animate-X proposes an enhanced motion representation strategy called the Pose Indicator, which operates within a Latent Diffusion Model (LDM) framework using a 3D-UNet architecture. The Pose Indicator has two components:
- Implicit Pose Indicator (IPI): This component aims to capture the "gist" of the motion from the driving video beyond simple keypoints. It uses a pre-trained CLIP image encoder to extract visual features (fφd) from the driving video frames. These features, which implicitly encode motion patterns, dynamics, and temporal relationships, are processed by a lightweight extractor module (composed of cross-attention and FFN layers). The query (Q) for the cross-attention combines embeddings from the detected DWPose keypoints (qp) and a learnable query vector (ql), allowing it to capture both explicit pose information and implicit motion cues from the CLIP features (K,V). The output (fi) serves as an implicit motion condition for the diffusion model.
- Explicit Pose Indicator (EPI): This component tackles the issue of misalignment between the reference character's shape and the driving pose sequence, which often occurs during inference, especially with non-human characters. EPI enhances the pose encoder's robustness by simulating such misalignments during training. It uses two main techniques:
- Pose Realignment: A driving pose (Ip) is aligned to a randomly sampled anchor pose (Ianchorp) from a pool, resulting in an aligned pose (Irealignp) that retains the original motion but adopts the anchor's body shape characteristics.
- Pose Rescale: Further transformations (like altering limb lengths, head size, removing/adding parts) are randomly applied to the realigned pose (Irealignp) with a high probability (λ>98%) to create the final transformed pose (Inp). This explicitly trains the model to handle diverse body shapes and potential inaccuracies in pose estimation for non-human characters. The transformed pose is encoded into an explicit pose feature (fe).
The overall Animate-X framework integrates these components:
- A reference image Ir provides appearance features via a CLIP encoder (fφr) and latent features via a VAE encoder (fer).
- A driving video I1:Fd provides motion information processed by IPI (fi) and EPI (fe).
- The 3D-UNet diffusion model (ϵθ) takes the noised latent variable, the explicit pose feature fe, and the reference latent feature fer as input.
- Appearance conditioning (fφr) and implicit motion conditioning (fi) are injected via cross-attention within the UNet's Spatial Attention and a dedicated Motion Attention module, respectively.
- Temporal consistency is handled by Mamba-based Temporal Attention modules.
- The final denoised latent z0 is decoded by the VAE decoder D to produce the output video.
To evaluate performance, particularly on anthropomorphic characters, the authors introduce a new benchmark, A2Bench (Animated Anthropomorphic Benchmark). This benchmark contains 500 anthropomorphic character images and corresponding dance videos, generated using GPT-4 for prompts and KLing AI for image/video synthesis.
Experiments demonstrate Animate-X's superiority over state-of-the-art methods (like Animate Anyone, Unianimate, MimicMotion, ControlNeXt, MusePose) on both A2Bench and standard human datasets (TikTok, Fashion). Quantitative results show better performance across metrics like PSNR, SSIM, LPIPS, FID, FID-VID, and FVD, especially in a challenging cross-driven setting designed for A2Bench where poses are intentionally misaligned. Qualitative results and a user paper further confirm that Animate-X excels at preserving character identity while achieving consistent and expressive motion, particularly for non-human characters where other methods often fail or introduce artifacts. Ablation studies validate the effectiveness of both IPI and EPI components.
The main contributions are:
- Animate-X, a universal animation framework generalizing to diverse characters (X), including anthropomorphic ones.
- The Pose Indicator (IPI + EPI) for enhanced, robust motion representation handling misalignment.
- The A2Bench dataset for evaluating anthropomorphic character animation.
Limitations include insufficient modeling of hands/faces and lack of real-time capability. Future work aims to address these and paper character-environment interactions. Ethical considerations regarding potential misuse for generating misleading content are also acknowledged.