Champ: 3D-Guided Human Image Animation

Updated 8 January 2026

The paper presents a novel latent diffusion framework that integrates SMPL-based 3D parametric guidance for accurate, shape-preserving human image animation.
It employs multi-modal rendering and spatial fusion techniques to encode depth, normals, semantic maps, and skeleton cues for enhanced motion control and temporal consistency.
Champ achieves state-of-the-art performance on standard benchmarks while demonstrating strong zero-shot generalization across diverse animation scenarios.

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance is a latent diffusion-based human image animation framework that employs a 3D human parametric model (SMPL) in conjunction with multi-modal conditioning and a dedicated spatial fusion module to achieve high-fidelity, shape-aligned, and temporally consistent video synthesis from a single reference image and a driving video. By explicitly encoding shape and pose signals from SMPL depth/normal/semantic renderings and auxiliary skeleton keypoints, Champ outperforms prior art on standard benchmarks and demonstrates strong zero-shot generalization across diverse domains (Zhu et al., 2024).

1. Core Objectives and Problem Formulation

Champ addresses the task of synthesizing a video $\{\hat I^i\}_{i=1}^N$ where a subject depicted in a single reference image $I_\mathrm{ref}$ is animated to follow the motion of a driving sequence $I^{1:N}$ , with strict preservation of appearance, identity, and detailed shape. The system aims to ensure reliable animation control (faithful pose transfer), temporal consistency, and resistance to shape discrepancies between reference and driver. This is accomplished by integrating 3D model-based guidance with a powerful latent diffusion model, enhancing motion control far beyond what is achievable with 2D poses or dense control signals alone.

The full Champ animation pipeline consists of:

SMPL fitting to $I_\mathrm{ref}$ and each $I^i$ in the driving sequence
Parametric shape alignment: All driving frames adopt the reference shape for consistent geometry, i.e., $H_\mathrm{trans}^i = \mathrm{SMPL}(\beta_\mathrm{ref}, \theta_\mathrm{m}^i)$ .
Multi-modal rendering: For every aligned mesh, Champ renders depth $D^i$ , normal $N^i$ , semantic part map $S^i$ , and extracts 2D skeleton keypoints $K^i$ .
Guidance encoding and multi-modal fusion: Each modality is encoded with a lightweight CNN and merged via spatial self-attention, producing fused frame-level guidance $y^i$ .
Latent diffusion-based synthesis: A temporal cross-attention UNet is conditioned on $I_\mathrm{ref}$ (encoded), $y^i$ , and (optionally) CLIP text, with frames decoded via a VAE.

2. Latent Diffusion Model Formulation

Champ’s backbone follows a latent diffusion paradigm:

Forward process (latent noise injection):

$\mathbf z_t = \sqrt{\bar\alpha_t}\,\mathbf z_0 + \sqrt{1-\bar\alpha_t}\,\epsilon, \quad \epsilon \sim \mathcal N(\mathbf 0, \mathbf I).$

Here, $\mathbf z_0$ is the VAE-encoded reference image (with $I_\mathrm{ref}$ controlling the output’s identity and appearance).

Reverse/denoising: Synthesis is performed by a UNet-based predictor

$\epsilon_\theta(\mathbf z_t, t, c_\mathrm{text}, y)$

which estimates and removes noise, with explicit cross-attention to conditioning signals.

Training Objective: The loss is the weighted diffusion MSE

$\mathcal L_\mathrm{diff} = \mathbb{E}_{\mathbf z_0, c_\mathrm{text}, \epsilon, t} \left[ \omega(t)\,\|\epsilon - \epsilon_\theta(\mathbf z_t, t, c_\mathrm{text}, y)\|^2 \right].$

Sampling and conditioning: Synthesis in deployment uses SDE/ODE solvers (e.g., DPM-solver), fusing guidance $y^i$ at each layer of the temporal UNet alongside text and temporal features.

Champ leverages the SMPL parametric human body model, parameterized by shape $\beta \in \mathbb R^{10}$ and pose $\theta \in \mathbb R^{24\times3\times3}$ , to produce full-body 3D meshes that serve as a consistent, geometry-aware motion guide. To transfer the driver’s motion but retain the reference’s body shape, Champ enforces “parametric shape alignment,” rendering for each frame:

Depth map $D^i(u, v)$
Normal map $N^i(u, v)$
Semantic part map $S^i(u, v)$ (e.g., body part segmentation)
Auxiliary skeleton keypoints $K^i$ (for detail in hands and face)

All signals are independently encoded (via $\mathcal{F}_m$ ) and passed through spatial self-attention layers, producing per-frame features $F_m^i$ for each modality. These are fused:

$y^i = \sum_{m \in \{\text{depth},\,\text{normal},\,\text{sem},\,\text{skel}\}} F_m^i.$

The fused $y^i$ guides the denoising network and enforces faithful geometry/motion transfer, while keeping identity and detailed shape strictly referenced from $I_\mathrm{ref}$ .

4. Multi-Layer Motion Fusion and UNet Integration

To maximize information integration and spatial selectivity, Champ employs a motion fusion module at multiple resolutions:

Each guidance signal is encoded with zero-initialized convolutions, followed by spatial self-attention of the form:

$\mathrm{Attn}(X) = \mathrm{softmax}(QK^T/\sqrt{d})V, \quad Q,K,V = XW_{q,k,v}$

All encoded modalities are summed to obtain $y^i$ , which is added to noisy latents $\mathbf z_t$ at every scale, prior to each UNet denoising block.

This design allows the model to dynamically reweight the influence of each guidance modality for every spatial location, yielding robustness against signal ambiguities and maximizing controllability.

5. Training Protocol, Data, and Losses

Champ is trained in two stages:

Stage 1 (image-level):

VAE and CLIP encoders are kept frozen.
Train guidance encoder, UNet, and ReferenceNet using random $I_\mathrm{ref}, I_\mathrm{target}$ pairs from the same video.
Resolution: $768 \times 768$ .

Stage 2 (video-level):

All Stage 1 weights frozen except temporal layers, trained on 24-frame clips (temporal aggregation).
Temporal layers initialized from AnimateDiff.

Training relies solely on the weighted diffusion MSE loss; no adversarial, reconstruction, or auxiliary losses are used, with regularization via $\omega(t)$ to balance timesteps.

Dataset: 5,000 real-world videos collected from five video sources (Bilibili, TikTok, Kuaishou, YouTube, Xiaohongshu), with extensive variation in age, pose, body type, and background.

6. Quantitative Results, Ablations, and Efficiency

Champ achieves state-of-the-art results on TikTok and a custom “in-the-wild” test set, according to standard metrics:

Method	L1↓	PSNR↑	SSIM↑	LPIPS↓	FID-VID↓	FVD↓
MRAA	3.21e-4	29.39	0.672	0.296	54.47	284.82
DisCo	3.78e-4	29.03	0.668	0.292	59.90	292.80
MagicAnimate	3.13e-4	29.16	0.714	0.239	21.75	179.07
AnimateAnyone	–	29.56	0.718	0.285	–	171.90
Champ	3.02e-4	29.84	0.773	0.235	26.14	170.20
Champ* (fine)	2.94e-4	29.91	0.802	0.234	21.07	160.82

Key ablations illustrate:

Removing SMPL and relying solely on skeletons degrades both fidelity and consistency (FVD rises from 170.20 to 192.34).
Omitting geometric guidance also drops metrics by a substantial margin.
The spatial self-attention module on fused guidance yields a 16% FID-VID improvement.
Without parametric shape alignment, Champ cannot correctly preserve extreme reference shapes under highly dissimilar driving motion.

Efficiency: Shape transfer requires 3.24 GB and 0.06 s/frame; multimodal rendering, 2.86 GB and 0.07 s/frame; inference, 19.83 GB and 0.52 s/frame.

7. Generalization, Limitations, and Directions

Generalization: Champ attains strong zero-shot transfer, outperforming baselines (MagicAnimate, AnimateAnyone) on standard and wild datasets.

Limitations:

The SMPL model lacks granularity for hands/face, so DWpose skeletons are needed for fine articulation.
Mismatches may occur between SMPL and auxiliary cues, particularly when body/limb occlusions or non-standard garments are present.

Future work:

Incorporate higher-fidelity parametric models for hand and face.
Co-train SMPL and skeleton guidance to minimize inconsistencies.
Explore differentiable rendering pipelines and end-to-end learning for the 3D guidance encoder.

8. Context within Animation Research

Champ represents a paradigm shift in human image animation by fusing the explicit control and geometry-awareness of 3D parametric models with the flexibility and expressivity of latent diffusion generative modeling. Relative to earlier approaches relying exclusively on 2D keypoints or dense control signals, Champ's shape alignment procedure avoids appearance distortion under body shape mismatch. However, methods such as DisPose (Li et al., 2024) critique Champ’s reliance on dense geometry for its risk of over-constraining results when reference and driver differ substantially in shape. Subsequent advances explore plug-and-play hybrid architectures and self-supervised dense correspondence, but Champ remains a canonical example of the benefits of 3D parametric guidance in achieving controllable, consistent, and high-fidelity human image animation (Zhu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance (2024)

DisPose: Disentangling Pose Guidance for Controllable Human Image Animation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Champ: Controllable and Consistent Human Image Animation.