Champ: 3D-Guided Human Image Animation
- The paper presents a novel latent diffusion framework that integrates SMPL-based 3D parametric guidance for accurate, shape-preserving human image animation.
- It employs multi-modal rendering and spatial fusion techniques to encode depth, normals, semantic maps, and skeleton cues for enhanced motion control and temporal consistency.
- Champ achieves state-of-the-art performance on standard benchmarks while demonstrating strong zero-shot generalization across diverse animation scenarios.
Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance is a latent diffusion-based human image animation framework that employs a 3D human parametric model (SMPL) in conjunction with multi-modal conditioning and a dedicated spatial fusion module to achieve high-fidelity, shape-aligned, and temporally consistent video synthesis from a single reference image and a driving video. By explicitly encoding shape and pose signals from SMPL depth/normal/semantic renderings and auxiliary skeleton keypoints, Champ outperforms prior art on standard benchmarks and demonstrates strong zero-shot generalization across diverse domains (Zhu et al., 2024).
1. Core Objectives and Problem Formulation
Champ addresses the task of synthesizing a video where a subject depicted in a single reference image is animated to follow the motion of a driving sequence , with strict preservation of appearance, identity, and detailed shape. The system aims to ensure reliable animation control (faithful pose transfer), temporal consistency, and resistance to shape discrepancies between reference and driver. This is accomplished by integrating 3D model-based guidance with a powerful latent diffusion model, enhancing motion control far beyond what is achievable with 2D poses or dense control signals alone.
The full Champ animation pipeline consists of:
- SMPL fitting to and each in the driving sequence
- Parametric shape alignment: All driving frames adopt the reference shape for consistent geometry, i.e., .
- Multi-modal rendering: For every aligned mesh, Champ renders depth , normal , semantic part map , and extracts 2D skeleton keypoints .
- Guidance encoding and multi-modal fusion: Each modality is encoded with a lightweight CNN and merged via spatial self-attention, producing fused frame-level guidance .
- Latent diffusion-based synthesis: A temporal cross-attention UNet is conditioned on (encoded), , and (optionally) CLIP text, with frames decoded via a VAE.
2. Latent Diffusion Model Formulation
Champ’s backbone follows a latent diffusion paradigm:
- Forward process (latent noise injection):
Here, is the VAE-encoded reference image (with controlling the output’s identity and appearance).
- Reverse/denoising: Synthesis is performed by a UNet-based predictor
which estimates and removes noise, with explicit cross-attention to conditioning signals.
- Training Objective: The loss is the weighted diffusion MSE
- Sampling and conditioning: Synthesis in deployment uses SDE/ODE solvers (e.g., DPM-solver), fusing guidance at each layer of the temporal UNet alongside text and temporal features.
3. Multi-Modal SMPL-Based Motion Guidance
Champ leverages the SMPL parametric human body model, parameterized by shape and pose , to produce full-body 3D meshes that serve as a consistent, geometry-aware motion guide. To transfer the driver’s motion but retain the reference’s body shape, Champ enforces “parametric shape alignment,” rendering for each frame:
- Depth map
- Normal map
- Semantic part map (e.g., body part segmentation)
- Auxiliary skeleton keypoints (for detail in hands and face)
All signals are independently encoded (via ) and passed through spatial self-attention layers, producing per-frame features for each modality. These are fused:
The fused guides the denoising network and enforces faithful geometry/motion transfer, while keeping identity and detailed shape strictly referenced from .
4. Multi-Layer Motion Fusion and UNet Integration
To maximize information integration and spatial selectivity, Champ employs a motion fusion module at multiple resolutions:
- Each guidance signal is encoded with zero-initialized convolutions, followed by spatial self-attention of the form:
- All encoded modalities are summed to obtain , which is added to noisy latents at every scale, prior to each UNet denoising block.
This design allows the model to dynamically reweight the influence of each guidance modality for every spatial location, yielding robustness against signal ambiguities and maximizing controllability.
5. Training Protocol, Data, and Losses
Champ is trained in two stages:
Stage 1 (image-level):
- VAE and CLIP encoders are kept frozen.
- Train guidance encoder, UNet, and ReferenceNet using random pairs from the same video.
- Resolution: .
Stage 2 (video-level):
- All Stage 1 weights frozen except temporal layers, trained on 24-frame clips (temporal aggregation).
- Temporal layers initialized from AnimateDiff.
Training relies solely on the weighted diffusion MSE loss; no adversarial, reconstruction, or auxiliary losses are used, with regularization via to balance timesteps.
Dataset: 5,000 real-world videos collected from five video sources (Bilibili, TikTok, Kuaishou, YouTube, Xiaohongshu), with extensive variation in age, pose, body type, and background.
6. Quantitative Results, Ablations, and Efficiency
Champ achieves state-of-the-art results on TikTok and a custom “in-the-wild” test set, according to standard metrics:
| Method | L1↓ | PSNR↑ | SSIM↑ | LPIPS↓ | FID-VID↓ | FVD↓ |
|---|---|---|---|---|---|---|
| MRAA | 3.21e-4 | 29.39 | 0.672 | 0.296 | 54.47 | 284.82 |
| DisCo | 3.78e-4 | 29.03 | 0.668 | 0.292 | 59.90 | 292.80 |
| MagicAnimate | 3.13e-4 | 29.16 | 0.714 | 0.239 | 21.75 | 179.07 |
| AnimateAnyone | – | 29.56 | 0.718 | 0.285 | – | 171.90 |
| Champ | 3.02e-4 | 29.84 | 0.773 | 0.235 | 26.14 | 170.20 |
| Champ* (fine) | 2.94e-4 | 29.91 | 0.802 | 0.234 | 21.07 | 160.82 |
Key ablations illustrate:
- Removing SMPL and relying solely on skeletons degrades both fidelity and consistency (FVD rises from 170.20 to 192.34).
- Omitting geometric guidance also drops metrics by a substantial margin.
- The spatial self-attention module on fused guidance yields a 16% FID-VID improvement.
- Without parametric shape alignment, Champ cannot correctly preserve extreme reference shapes under highly dissimilar driving motion.
Efficiency: Shape transfer requires 3.24 GB and 0.06 s/frame; multimodal rendering, 2.86 GB and 0.07 s/frame; inference, 19.83 GB and 0.52 s/frame.
7. Generalization, Limitations, and Directions
Generalization: Champ attains strong zero-shot transfer, outperforming baselines (MagicAnimate, AnimateAnyone) on standard and wild datasets.
Limitations:
- The SMPL model lacks granularity for hands/face, so DWpose skeletons are needed for fine articulation.
- Mismatches may occur between SMPL and auxiliary cues, particularly when body/limb occlusions or non-standard garments are present.
Future work:
- Incorporate higher-fidelity parametric models for hand and face.
- Co-train SMPL and skeleton guidance to minimize inconsistencies.
- Explore differentiable rendering pipelines and end-to-end learning for the 3D guidance encoder.
8. Context within Animation Research
Champ represents a paradigm shift in human image animation by fusing the explicit control and geometry-awareness of 3D parametric models with the flexibility and expressivity of latent diffusion generative modeling. Relative to earlier approaches relying exclusively on 2D keypoints or dense control signals, Champ's shape alignment procedure avoids appearance distortion under body shape mismatch. However, methods such as DisPose (Li et al., 2024) critique Champ’s reliance on dense geometry for its risk of over-constraining results when reference and driver differ substantially in shape. Subsequent advances explore plug-and-play hybrid architectures and self-supervised dense correspondence, but Champ remains a canonical example of the benefits of 3D parametric guidance in achieving controllable, consistent, and high-fidelity human image animation (Zhu et al., 2024).