Papers
Topics
Authors
Recent
2000 character limit reached

Champ: 3D-Guided Human Image Animation

Updated 8 January 2026
  • The paper presents a novel latent diffusion framework that integrates SMPL-based 3D parametric guidance for accurate, shape-preserving human image animation.
  • It employs multi-modal rendering and spatial fusion techniques to encode depth, normals, semantic maps, and skeleton cues for enhanced motion control and temporal consistency.
  • Champ achieves state-of-the-art performance on standard benchmarks while demonstrating strong zero-shot generalization across diverse animation scenarios.

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance is a latent diffusion-based human image animation framework that employs a 3D human parametric model (SMPL) in conjunction with multi-modal conditioning and a dedicated spatial fusion module to achieve high-fidelity, shape-aligned, and temporally consistent video synthesis from a single reference image and a driving video. By explicitly encoding shape and pose signals from SMPL depth/normal/semantic renderings and auxiliary skeleton keypoints, Champ outperforms prior art on standard benchmarks and demonstrates strong zero-shot generalization across diverse domains (Zhu et al., 2024).

1. Core Objectives and Problem Formulation

Champ addresses the task of synthesizing a video {I^i}i=1N\{\hat I^i\}_{i=1}^N where a subject depicted in a single reference image IrefI_\mathrm{ref} is animated to follow the motion of a driving sequence I1:NI^{1:N}, with strict preservation of appearance, identity, and detailed shape. The system aims to ensure reliable animation control (faithful pose transfer), temporal consistency, and resistance to shape discrepancies between reference and driver. This is accomplished by integrating 3D model-based guidance with a powerful latent diffusion model, enhancing motion control far beyond what is achievable with 2D poses or dense control signals alone.

The full Champ animation pipeline consists of:

  1. SMPL fitting to IrefI_\mathrm{ref} and each IiI^i in the driving sequence
  2. Parametric shape alignment: All driving frames adopt the reference shape for consistent geometry, i.e., Htransi=SMPL(βref,θmi)H_\mathrm{trans}^i = \mathrm{SMPL}(\beta_\mathrm{ref}, \theta_\mathrm{m}^i).
  3. Multi-modal rendering: For every aligned mesh, Champ renders depth DiD^i, normal NiN^i, semantic part map SiS^i, and extracts 2D skeleton keypoints KiK^i.
  4. Guidance encoding and multi-modal fusion: Each modality is encoded with a lightweight CNN and merged via spatial self-attention, producing fused frame-level guidance yiy^i.
  5. Latent diffusion-based synthesis: A temporal cross-attention UNet is conditioned on IrefI_\mathrm{ref} (encoded), yiy^i, and (optionally) CLIP text, with frames decoded via a VAE.

2. Latent Diffusion Model Formulation

Champ’s backbone follows a latent diffusion paradigm:

  • Forward process (latent noise injection):

zt=αˉtz0+1αˉtϵ,ϵN(0,I).\mathbf z_t = \sqrt{\bar\alpha_t}\,\mathbf z_0 + \sqrt{1-\bar\alpha_t}\,\epsilon, \quad \epsilon \sim \mathcal N(\mathbf 0, \mathbf I).

Here, z0\mathbf z_0 is the VAE-encoded reference image (with IrefI_\mathrm{ref} controlling the output’s identity and appearance).

  • Reverse/denoising: Synthesis is performed by a UNet-based predictor

ϵθ(zt,t,ctext,y)\epsilon_\theta(\mathbf z_t, t, c_\mathrm{text}, y)

which estimates and removes noise, with explicit cross-attention to conditioning signals.

  • Training Objective: The loss is the weighted diffusion MSE

Ldiff=Ez0,ctext,ϵ,t[ω(t)ϵϵθ(zt,t,ctext,y)2].\mathcal L_\mathrm{diff} = \mathbb{E}_{\mathbf z_0, c_\mathrm{text}, \epsilon, t} \left[ \omega(t)\,\|\epsilon - \epsilon_\theta(\mathbf z_t, t, c_\mathrm{text}, y)\|^2 \right].

  • Sampling and conditioning: Synthesis in deployment uses SDE/ODE solvers (e.g., DPM-solver), fusing guidance yiy^i at each layer of the temporal UNet alongside text and temporal features.

3. Multi-Modal SMPL-Based Motion Guidance

Champ leverages the SMPL parametric human body model, parameterized by shape βR10\beta \in \mathbb R^{10} and pose θR24×3×3\theta \in \mathbb R^{24\times3\times3}, to produce full-body 3D meshes that serve as a consistent, geometry-aware motion guide. To transfer the driver’s motion but retain the reference’s body shape, Champ enforces “parametric shape alignment,” rendering for each frame:

  • Depth map Di(u,v)D^i(u, v)
  • Normal map Ni(u,v)N^i(u, v)
  • Semantic part map Si(u,v)S^i(u, v) (e.g., body part segmentation)
  • Auxiliary skeleton keypoints KiK^i (for detail in hands and face)

All signals are independently encoded (via Fm\mathcal{F}_m) and passed through spatial self-attention layers, producing per-frame features FmiF_m^i for each modality. These are fused:

yi=m{depth,normal,sem,skel}Fmi.y^i = \sum_{m \in \{\text{depth},\,\text{normal},\,\text{sem},\,\text{skel}\}} F_m^i.

The fused yiy^i guides the denoising network and enforces faithful geometry/motion transfer, while keeping identity and detailed shape strictly referenced from IrefI_\mathrm{ref}.

4. Multi-Layer Motion Fusion and UNet Integration

To maximize information integration and spatial selectivity, Champ employs a motion fusion module at multiple resolutions:

  • Each guidance signal is encoded with zero-initialized convolutions, followed by spatial self-attention of the form:

Attn(X)=softmax(QKT/d)V,Q,K,V=XWq,k,v\mathrm{Attn}(X) = \mathrm{softmax}(QK^T/\sqrt{d})V, \quad Q,K,V = XW_{q,k,v}

  • All encoded modalities are summed to obtain yiy^i, which is added to noisy latents zt\mathbf z_t at every scale, prior to each UNet denoising block.

This design allows the model to dynamically reweight the influence of each guidance modality for every spatial location, yielding robustness against signal ambiguities and maximizing controllability.

5. Training Protocol, Data, and Losses

Champ is trained in two stages:

Stage 1 (image-level):

  • VAE and CLIP encoders are kept frozen.
  • Train guidance encoder, UNet, and ReferenceNet using random Iref,ItargetI_\mathrm{ref}, I_\mathrm{target} pairs from the same video.
  • Resolution: 768×768768 \times 768.

Stage 2 (video-level):

  • All Stage 1 weights frozen except temporal layers, trained on 24-frame clips (temporal aggregation).
  • Temporal layers initialized from AnimateDiff.

Training relies solely on the weighted diffusion MSE loss; no adversarial, reconstruction, or auxiliary losses are used, with regularization via ω(t)\omega(t) to balance timesteps.

Dataset: 5,000 real-world videos collected from five video sources (Bilibili, TikTok, Kuaishou, YouTube, Xiaohongshu), with extensive variation in age, pose, body type, and background.

6. Quantitative Results, Ablations, and Efficiency

Champ achieves state-of-the-art results on TikTok and a custom “in-the-wild” test set, according to standard metrics:

Method L1↓ PSNR↑ SSIM↑ LPIPS↓ FID-VID↓ FVD↓
MRAA 3.21e-4 29.39 0.672 0.296 54.47 284.82
DisCo 3.78e-4 29.03 0.668 0.292 59.90 292.80
MagicAnimate 3.13e-4 29.16 0.714 0.239 21.75 179.07
AnimateAnyone 29.56 0.718 0.285 171.90
Champ 3.02e-4 29.84 0.773 0.235 26.14 170.20
Champ* (fine) 2.94e-4 29.91 0.802 0.234 21.07 160.82

Key ablations illustrate:

  • Removing SMPL and relying solely on skeletons degrades both fidelity and consistency (FVD rises from 170.20 to 192.34).
  • Omitting geometric guidance also drops metrics by a substantial margin.
  • The spatial self-attention module on fused guidance yields a 16% FID-VID improvement.
  • Without parametric shape alignment, Champ cannot correctly preserve extreme reference shapes under highly dissimilar driving motion.

Efficiency: Shape transfer requires 3.24 GB and 0.06 s/frame; multimodal rendering, 2.86 GB and 0.07 s/frame; inference, 19.83 GB and 0.52 s/frame.

7. Generalization, Limitations, and Directions

Generalization: Champ attains strong zero-shot transfer, outperforming baselines (MagicAnimate, AnimateAnyone) on standard and wild datasets.

Limitations:

  • The SMPL model lacks granularity for hands/face, so DWpose skeletons are needed for fine articulation.
  • Mismatches may occur between SMPL and auxiliary cues, particularly when body/limb occlusions or non-standard garments are present.

Future work:

  • Incorporate higher-fidelity parametric models for hand and face.
  • Co-train SMPL and skeleton guidance to minimize inconsistencies.
  • Explore differentiable rendering pipelines and end-to-end learning for the 3D guidance encoder.

8. Context within Animation Research

Champ represents a paradigm shift in human image animation by fusing the explicit control and geometry-awareness of 3D parametric models with the flexibility and expressivity of latent diffusion generative modeling. Relative to earlier approaches relying exclusively on 2D keypoints or dense control signals, Champ's shape alignment procedure avoids appearance distortion under body shape mismatch. However, methods such as DisPose (Li et al., 2024) critique Champ’s reliance on dense geometry for its risk of over-constraining results when reference and driver differ substantially in shape. Subsequent advances explore plug-and-play hybrid architectures and self-supervised dense correspondence, but Champ remains a canonical example of the benefits of 3D parametric guidance in achieving controllable, consistent, and high-fidelity human image animation (Zhu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Champ: Controllable and Consistent Human Image Animation.