Papers
Topics
Authors
Recent
Search
2000 character limit reached

StableAnimator++: ID-Preserving Video Diffusion

Updated 20 March 2026
  • StableAnimator++ is a video diffusion framework that integrates a learnable pose alignment module and advanced face-content fusion to ensure robust identity preservation.
  • It employs a latent-space denoising diffusion model with transformer-based refinement, achieving high-quality results measured by PSNR, SSIM, LPIPS, CSIM, and FVD.
  • The system incorporates an inner HJB-inspired optimization loop during inference to enhance face fidelity, eliminating the need for post-processing face refinement.

StableAnimator++ is an ID-preserving video diffusion framework for human image animation that introduces a fully differentiable, learnable pose alignment module and advanced face-content fusion mechanisms to overcome longstanding challenges in identity (ID) consistency, especially under severe misalignments in scale, rotation, and translation between the reference image and driving pose. The system integrates a series of novel architectural and algorithmic innovations to generate high-fidelity, temporally coherent animation videos conditioned on a still reference image and a sequence of target poses, without resorting to any post-processing or third-party face refinement steps (Tu et al., 20 Jul 2025).

1. Video Diffusion Backbone and Conditioning Scheme

StableAnimator++ is built on a latent-space video denoising diffusion probabilistic model (DDPM) operating similarly to Stable Video Diffusion. Animation targets a sequence of TT frames, with corresponding latent representations z0RT×C×H×Wz_0 \in \mathbb{R}^{T \times C \times H \times W} obtained via a frozen VAE encoder. The driving pose sequence PRT×2×NP \in \mathbb{R}^{T \times 2 \times N} (with NN keypoints per frame) is processed by a dedicated PoseNet and injected into a U-Net UθU_\theta.

The diffusion process follows the standard Gaussian forward and reverse kernels:

  • Forward (noising):

q(ztzt1)=N(zt;1βtzt1,βtI)q(z_t \mid z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{1-\beta_t}z_{t-1}, \beta_t I\right)

where {βt}\{\beta_t\} is a fixed noise schedule.

  • Reverse (denoising):

pθ(zt1zt)=N(zt1;μθ(zt,t),Σθ(t))p_\theta(z_{t-1}|z_t)=\mathcal{N}\left(z_{t-1}; \mu_\theta(z_t, t), \Sigma_\theta(t)\right)

The model minimizes the standard denoising loss:

Ldenoise=Ez0,ϵN(0,I),tϵϵθ(zt,t,cond)2\mathcal{L}_{\text{denoise}} = \mathbb{E}_{z_0, \epsilon \sim \mathcal{N}(0, I), t} \left\|\epsilon - \epsilon_\theta(z_t, t, \text{cond})\right\|^2

where the conditioning signals (“cond”) include pose data, CLIP embeddings, and face embeddings.

2. Learnable Pose Alignment Module

To address the challenge of pose misalignment—when the driving pose and reference image differ notably in size, position, or rotation—StableAnimator++ introduces a differentiable alignment module. For each frame, the aim is to predict a 2×2 rotation–scale matrix RR' and 2×1 translation tt' in SE(2), aligning the driven pose to the reference.

Alignment proceeds in two stages:

  • SVD Guidance: Keypoints are centered and an optimal rotation RR and scaling factor ss are extracted via a one-shot SVD-based procedure (akin to Iterative Closest Point):
    • Compute keypoint means μr\mu_r, μd\mu_d for reference and driven pose.
    • Calculate covariance KK, perform SVD (K=UΣVK=U \Sigma V^\top, R=VUR=V U^\top), compute scale ss and translation tt.
    • Obtain an initially aligned skeleton Pˉd=sRPd+t\bar{P}_d = s R P_d + t.
  • Learnable Refinement: The raw and SVD-aligned keypoints are concatenated and passed through a Transformer encoder, followed by PoseFusion attention and an MLP predicting RR', ss', tt'. The final aligned pose is Pdalign=sRPd+tP_d^{\rm align} = s' R' P_d + t'. The module is trained with alignment supervision to robustly correct arbitrary geometric misalignments.

3. Face and Image Embedding Strategies

Identity preservation requires robust embedding and fusion of face and global image information. StableAnimator++ extracts three parallel conditioning streams from the reference image IrefI_{\text{ref}}:

  • Latent Image Content (zimgz_{\text{img}}): Encoded by VAE; broadcast to all frames.
  • CLIP Image Embeddings (fimgf_{\text{img}}): Tokenized by a CLIP image encoder (EimgE_{\text{img}}), enabling semantic global content injection via cross-attention in every U-Net block.
  • Face Encoding (ffacef_{\text{face}}): Computed with ArcFace; refined by a custom Global Content-Aware Face Encoder stacking LfL_f cross-attention layers, thus allowing the face vector to integrate information about overall appearance (such as hair and lighting) prior to diffusion conditioning.

4. Distribution-Aware ID Adapter

Fusion of face identity vectors can be disrupted by temporal processing in the U-Net (such as 1D convolutions or temporal attention), which shifts feature distributions and undermines the semantic integrity of face information. To prevent this, an ID Adapter operates at every spatial block:

  • Compute cross-attended image and face features (ziimg,zifacez_i^{\text{img}}, z_i^{\text{face}}).
  • Calculate per-channel means and variances (μimg,σimg;μface,σface\mu_{\text{img}},\sigma_{\text{img}}; \mu_{\text{face}},\sigma_{\text{face}}).
  • Normalize face features to match the image distribution:

zˉiface=zifaceμfaceσfaceσimg+μimg\bar{z}_i^{\text{face}} = \frac{z_i^{\text{face}} - \mu_{\text{face}}}{\sigma_{\text{face}}}\sigma_{\text{img}} + \mu_{\text{img}}

  • Fuse (ziimg+zˉifacez_i^{\text{img}} + \bar{z}_i^{\text{face}}) and propagate through temporal layers.
  • Optionally, an ID matching loss (LID\mathcal{L}_{\text{ID}}) aligns the distributions by minimizing KL divergence.

5. HJB-Based Face Optimization in Inference

To further boost face fidelity and avoid post-processing face-swapping, StableAnimator++ employs an inner optimization loop inspired by the Hamilton-Jacobi-Bellman (HJB) optimal control principle, interleaved in the denoising steps during inference:

  • At each denoising iteration ii, after computing the predicted latent xpredx_{\text{pred}}, an inner KK-step gradient descent is run to minimize the identity loss between the synthesized face and the reference (using ArcFace cosine similarity). The update is performed directly in latent space and the result replaces xpredx_{\text{pred}} in the usual denoising formula.
  • This approach formally parallels optimal trajectory steering with control ut=xopxpredu_t = x_{\text{op}} - x_{\text{pred}}, minimizing quadratic control effort and terminal ID mismatch.

6. Training Objectives and Loss Functions

The system employs a compound loss:

  • Pose Alignment Pretraining:

Lalign=E[Pdgt(sRPd+t)2]\mathcal{L}_{\text{align}} = \mathbb{E}\left[\|P_d^{\text{gt}} - (s' R' P_d + t')\|_2\right]

  • Diffusion Reconstruction: Places increased emphasis on face pixels:

Lrec=Eϵ,t(z0zϵ)(1+Mface)2\mathcal{L}_{\text{rec}} = \mathbb{E}_{\epsilon, t}\left\|(z_0 - z_\epsilon)\odot(1 + M_{\text{face}})\right\|^2

  • (Optional) Distribution Alignment: KL divergence for ID distributions,

LID=DKL(N(μimg,σimg2)    N(μface,σface2))\mathcal{L}_{\text{ID}} = D_{\text{KL}}\big(\mathcal{N}(\mu_{\text{img}}, \sigma_{\text{img}}^2)\;||\; \mathcal{N}(\mu_{\text{face}}, \sigma_{\text{face}}^2)\big)

The total objective is

L=Lrec+λalignLalign+λIDLID\mathcal{L} = \mathcal{L}_{\text{rec}} + \lambda_{\text{align}} \mathcal{L}_{\text{align}} + \lambda_{\text{ID}} \mathcal{L}_{\text{ID}}

with recommended λalign1\lambda_{\text{align}} \approx 1, λID0.1\lambda_{\text{ID}} \approx 0.1 (sometimes omitted).

7. Empirical Results and Significance

Extensive evaluation compares StableAnimator++ to prior works such as Animate-X across metrics including PSNR, SSIM, LPIPS, CSIM (face ID similarity), and FVD. Experiments span the TikTok test set (\sim10k frames) and the challenging MisAlign100 (100 misaligned video clips):

Metric TikTok (↑/↓) MisAlign100 (↑/↓) Animate-X (↑/↓)
PSNR 30.8 30.2 30.8 / 26.8
SSIM 0.816 0.709 0.806 / 0.512
LPIPS ↓ 0.230 0.375 0.232 / 0.429
CSIM (ID ↑) 0.831 0.802 0.475 / 0.391
FVD ↓ 122.5 384.3 139.0 / 675.3

StableAnimator++ demonstrates significant improvements in identity preservation (CSIM +100 points in MisAlign100), robustness to pose discrepancies, and overall quality. Ablation studies confirm the crucial role of each architectural module. For example, omitting pose alignment reduces CSIM from 0.802 to 0.448 on MisAlign100; removing the Face Encoder yields 0.572; the absence of the ID Adapter increases FVD to 587; eliminating HJB optimization decreases CSIM and increases FVD. Memory usage averages 11.4 GB, with inference runtimes of approximately 84 seconds for 16 frames using 8×A100 GPUs.

User studies indicate preference for StableAnimator++ in 88–96% (motion alignment), 92–99% (appearance fidelity), and 90–93% (background alignment) of severely misaligned cases.

In summary, StableAnimator++ constitutes the first fully end-to-end, ID-preserving animation diffusion framework to integrate learnable geometric alignment, deep identity fusion, and direct face refinement within the generation process, establishing new empirical benchmarks for difficult human animation scenarios marked by strong pose misalignment and appearance shifts (Tu et al., 20 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to StableAnimator++.