StableAnimator++: ID-Preserving Video Diffusion

Updated 20 March 2026

StableAnimator++ is a video diffusion framework that integrates a learnable pose alignment module and advanced face-content fusion to ensure robust identity preservation.
It employs a latent-space denoising diffusion model with transformer-based refinement, achieving high-quality results measured by PSNR, SSIM, LPIPS, CSIM, and FVD.
The system incorporates an inner HJB-inspired optimization loop during inference to enhance face fidelity, eliminating the need for post-processing face refinement.

StableAnimator++ is an ID-preserving video diffusion framework for human image animation that introduces a fully differentiable, learnable pose alignment module and advanced face-content fusion mechanisms to overcome longstanding challenges in identity (ID) consistency, especially under severe misalignments in scale, rotation, and translation between the reference image and driving pose. The system integrates a series of novel architectural and algorithmic innovations to generate high-fidelity, temporally coherent animation videos conditioned on a still reference image and a sequence of target poses, without resorting to any post-processing or third-party face refinement steps (Tu et al., 20 Jul 2025).

1. Video Diffusion Backbone and Conditioning Scheme

StableAnimator++ is built on a latent-space video denoising diffusion probabilistic model (DDPM) operating similarly to Stable Video Diffusion. Animation targets a sequence of $T$ frames, with corresponding latent representations $z_0 \in \mathbb{R}^{T \times C \times H \times W}$ obtained via a frozen VAE encoder. The driving pose sequence $P \in \mathbb{R}^{T \times 2 \times N}$ (with $N$ keypoints per frame) is processed by a dedicated PoseNet and injected into a U-Net $U_\theta$ .

The diffusion process follows the standard Gaussian forward and reverse kernels:

Forward (noising):

$q(z_t \mid z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{1-\beta_t}z_{t-1}, \beta_t I\right)$

where $\{\beta_t\}$ is a fixed noise schedule.

Reverse (denoising):

$p_\theta(z_{t-1}|z_t)=\mathcal{N}\left(z_{t-1}; \mu_\theta(z_t, t), \Sigma_\theta(t)\right)$

The model minimizes the standard denoising loss:

$\mathcal{L}_{\text{denoise}} = \mathbb{E}_{z_0, \epsilon \sim \mathcal{N}(0, I), t} \left\|\epsilon - \epsilon_\theta(z_t, t, \text{cond})\right\|^2$

where the conditioning signals (“cond”) include pose data, CLIP embeddings, and face embeddings.

2. Learnable Pose Alignment Module

To address the challenge of pose misalignment—when the driving pose and reference image differ notably in size, position, or rotation—StableAnimator++ introduces a differentiable alignment module. For each frame, the aim is to predict a 2×2 rotation–scale matrix $R'$ and 2×1 translation $t'$ in SE(2), aligning the driven pose to the reference.

Alignment proceeds in two stages:

SVD Guidance: Keypoints are centered and an optimal rotation $R$ $R$ and scaling factor $s$ $s$ are extracted via a one-shot SVD-based procedure (akin to Iterative Closest Point):
- Compute keypoint means $\mu_r$ , $\mu_d$ for reference and driven pose.
- Calculate covariance $K$ , perform SVD ( $K=U \Sigma V^\top$ , $R=V U^\top$ ), compute scale $s$ and translation $t$ .
- Obtain an initially aligned skeleton $\bar{P}_d = s R P_d + t$ .
Learnable Refinement: The raw and SVD-aligned keypoints are concatenated and passed through a Transformer encoder, followed by PoseFusion attention and an MLP predicting $R'$ , $s'$ , $t'$ . The final aligned pose is $P_d^{\rm align} = s' R' P_d + t'$ . The module is trained with alignment supervision to robustly correct arbitrary geometric misalignments.

3. Face and Image Embedding Strategies

Identity preservation requires robust embedding and fusion of face and global image information. StableAnimator++ extracts three parallel conditioning streams from the reference image $I_{\text{ref}}$ :

Latent Image Content ( $z_{\text{img}}$ ): Encoded by VAE; broadcast to all frames.
CLIP Image Embeddings ( $f_{\text{img}}$ ): Tokenized by a CLIP image encoder ( $E_{\text{img}}$ ), enabling semantic global content injection via cross-attention in every U-Net block.
Face Encoding ( $f_{\text{face}}$ ): Computed with ArcFace; refined by a custom Global Content-Aware Face Encoder stacking $L_f$ cross-attention layers, thus allowing the face vector to integrate information about overall appearance (such as hair and lighting) prior to diffusion conditioning.

4. Distribution-Aware ID Adapter

Fusion of face identity vectors can be disrupted by temporal processing in the U-Net (such as 1D convolutions or temporal attention), which shifts feature distributions and undermines the semantic integrity of face information. To prevent this, an ID Adapter operates at every spatial block:

Compute cross-attended image and face features ( $z_i^{\text{img}}, z_i^{\text{face}}$ ).
Calculate per-channel means and variances ( $\mu_{\text{img}},\sigma_{\text{img}}; \mu_{\text{face}},\sigma_{\text{face}}$ ).
Normalize face features to match the image distribution:

$\bar{z}_i^{\text{face}} = \frac{z_i^{\text{face}} - \mu_{\text{face}}}{\sigma_{\text{face}}}\sigma_{\text{img}} + \mu_{\text{img}}$

Fuse ( $z_i^{\text{img}} + \bar{z}_i^{\text{face}}$ ) and propagate through temporal layers.
Optionally, an ID matching loss ( $\mathcal{L}_{\text{ID}}$ ) aligns the distributions by minimizing KL divergence.

5. HJB-Based Face Optimization in Inference

To further boost face fidelity and avoid post-processing face-swapping, StableAnimator++ employs an inner optimization loop inspired by the Hamilton-Jacobi-Bellman (HJB) optimal control principle, interleaved in the denoising steps during inference:

At each denoising iteration $i$ , after computing the predicted latent $x_{\text{pred}}$ , an inner $K$ -step gradient descent is run to minimize the identity loss between the synthesized face and the reference (using ArcFace cosine similarity). The update is performed directly in latent space and the result replaces $x_{\text{pred}}$ in the usual denoising formula.
This approach formally parallels optimal trajectory steering with control $u_t = x_{\text{op}} - x_{\text{pred}}$ , minimizing quadratic control effort and terminal ID mismatch.

6. Training Objectives and Loss Functions

The system employs a compound loss:

Pose Alignment Pretraining:

$\mathcal{L}_{\text{align}} = \mathbb{E}\left[\|P_d^{\text{gt}} - (s' R' P_d + t')\|_2\right]$

Diffusion Reconstruction: Places increased emphasis on face pixels:

$\mathcal{L}_{\text{rec}} = \mathbb{E}_{\epsilon, t}\left\|(z_0 - z_\epsilon)\odot(1 + M_{\text{face}})\right\|^2$

(Optional) Distribution Alignment: KL divergence for ID distributions,

$\mathcal{L}_{\text{ID}} = D_{\text{KL}}\big(\mathcal{N}(\mu_{\text{img}}, \sigma_{\text{img}}^2)\;||\; \mathcal{N}(\mu_{\text{face}}, \sigma_{\text{face}}^2)\big)$

The total objective is

$\mathcal{L} = \mathcal{L}_{\text{rec}} + \lambda_{\text{align}} \mathcal{L}_{\text{align}} + \lambda_{\text{ID}} \mathcal{L}_{\text{ID}}$

with recommended $\lambda_{\text{align}} \approx 1$ , $\lambda_{\text{ID}} \approx 0.1$ (sometimes omitted).

7. Empirical Results and Significance

Extensive evaluation compares StableAnimator++ to prior works such as Animate-X across metrics including PSNR, SSIM, LPIPS, CSIM (face ID similarity), and FVD. Experiments span the TikTok test set ( $\sim$ 10k frames) and the challenging MisAlign100 (100 misaligned video clips):

Metric	TikTok (↑/↓)	MisAlign100 (↑/↓)	Animate-X (↑/↓)
PSNR	30.8	30.2	30.8 / 26.8
SSIM	0.816	0.709	0.806 / 0.512
LPIPS ↓	0.230	0.375	0.232 / 0.429
CSIM (ID ↑)	0.831	0.802	0.475 / 0.391
FVD ↓	122.5	384.3	139.0 / 675.3

StableAnimator++ demonstrates significant improvements in identity preservation (CSIM +100 points in MisAlign100), robustness to pose discrepancies, and overall quality. Ablation studies confirm the crucial role of each architectural module. For example, omitting pose alignment reduces CSIM from 0.802 to 0.448 on MisAlign100; removing the Face Encoder yields 0.572; the absence of the ID Adapter increases FVD to 587; eliminating HJB optimization decreases CSIM and increases FVD. Memory usage averages 11.4 GB, with inference runtimes of approximately 84 seconds for 16 frames using 8×A100 GPUs.

User studies indicate preference for StableAnimator++ in 88–96% (motion alignment), 92–99% (appearance fidelity), and 90–93% (background alignment) of severely misaligned cases.

In summary, StableAnimator++ constitutes the first fully end-to-end, ID-preserving animation diffusion framework to integrate learnable geometric alignment, deep identity fusion, and direct face refinement within the generation process, establishing new empirical benchmarks for difficult human animation scenarios marked by strong pose misalignment and appearance shifts (Tu et al., 20 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to StableAnimator++.