Papers
Topics
Authors
Recent
2000 character limit reached

FactorPortrait: Controllable Portrait Animation

Updated 19 December 2025
  • FactorPortrait is a video diffusion-based system for portrait animation that disentangles control signals of facial expression, head pose, and camera viewpoint.
  • It employs a transformer-based architecture with explicit identity, pose, and camera encoding and an AdaLN-based expression controller to maintain temporal coherence and photorealism.
  • The approach leverages large-scale real and synthetic video datasets with a progressive training curriculum to achieve state-of-the-art performance and robustness in animation control.

FactorPortrait is a video diffusion-based system for highly controllable portrait animation, enabling frame-accurate synthesis of human face and head dynamics from a single reference image, jointly driven by disentangled facial expression, head pose, and camera viewpoint control signals. The method employs a transformer-based video diffusion architecture integrating explicit identity, pose and camera encodings, and a novel mechanism for expression conditioning, and is trained using large-scale real and synthetic video datasets to achieve state-of-the-art realism, view consistency, and controllability (Tang et al., 12 Dec 2025).

1. Disentanglement Framework for Portrait Animation

FactorPortrait is formulated to address the task of animating a reference portrait image II by transferring dynamic facial expressions and head poses from a driving video DD, along with a user-specified camera trajectory CC. The principal technical challenge is to disentangle and independently control three factors:

  • Expression: Fine facial dynamics, such as micro-expressions, mouth interior, and wrinkles.
  • Head pose: 3D head and upper-body orientation.
  • Camera viewpoint: Explicit novel view synthesis.

This disentanglement is explicitly operationalized via learned latents (for expression), rendered normal maps (for 3D pose), and Plücker ray maps (for camera view), ensuring that manipulation of one factor does not inadvertently alter others or corrupt identity. The generative process maintains high-frequency detail and temporal coherence across the output video (Tang et al., 12 Dec 2025).

2. Model Architecture and Conditioning Mechanisms

2.1 Video Diffusion Backbone

The architecture centers on a transformer-based video diffusion network, specifically a Wan-DiT backbone, which encodes an input video V∈RT×H×W×3V \in \mathbb{R}^{T \times H \times W \times 3} into a latent tensor z∈Rl×h×w×cz \in \mathbb{R}^{l \times h \times w \times c} via a causal VAE encoder EVAEE_\text{VAE}, where l=(T+3)//4l = (T+3)//4, h=H//8h = H//8, w=W//8w = W//8, and c=16c = 16. The training procedure perturbs zz with Gaussian noise and optimizes a flow-matching or DDPM-style denoising objective in latent space.

2.2 Disentangled Control Signals

The network receives four distinct conditioning sources:

  • Identity: zI=EVAE(I)z_I = E_\text{VAE}(I), encoding the static reference image.
  • Head pose: Per-frame 3D mesh MDi\mathcal{M}_{D_i} is estimated from DiD_i, rendered as a normal map NDi∈RH×W×3N_{D_i} \in \mathbb{R}^{H \times W \times 3}.
  • Camera view: Relative pose Ï€i\pi_i yields a per-pixel Plücker ray map Ri∈RH×W×6R_i \in \mathbb{R}^{H \times W \times 6}.
  • Expression: A pre-trained ResNet34-based encoder produces dense 128-dimensional latent codes EDiE_{D_i}, which capture expression information while being invariant to identity and rigid head pose.

A specialized condition fusion layer concatenates these sources temporally and channel-wise. During synthesis, spatial maps for pose/view are downsampled via the VAE encoder to align with the latent grid. Reference frames for dynamic conditions are zeroed during inference.

2.3 Expression Controller and AdaLN Integration

A two-layer self-attention mechanism aggregates temporal dependencies from per-frame expression codes, forming embeddings eje_j that are grouped and added to the DiT transformer’s shared timestep embedding t∈Rdt \in \mathbb{R}^d, resulting in a frame-specific timestep tj=t+ejt_j = t + e_j. Each DiT transformer block employs an Adaptive LayerNorm (AdaLN) parameterized by small MLPs on tjt_j:

AdaLN(zj,tj)=γ(tj)⊙zj−μ(zj)σ(zj)+β(tj)\text{AdaLN}(z_j, t_j) = \gamma(t_j) \odot \frac{z_j - \mu(z_j)}{\sigma(z_j)} + \beta(t_j)

This mechanism enables precise per-frame modulation of features according to the expression latents, with temporal consistency enforced by the self-attention layers.

2.4 Training Objective and Curriculum

The loss is the standard denoising diffusion loss:

L=Ez,ϵ,t∥vθ(zt,cond)−vtrue(zt)∥2,L = \mathbb{E}_{z, \epsilon, t} \| v_\theta(z_t, \text{cond}) - v_\text{true}(z_t) \|^2,

where zt=z+σt2ϵz_t = z + \sqrt{\sigma_t^2} \epsilon, ϵ∼N(0,I)\epsilon \sim \mathcal{N}(0, I), and cond\text{cond} aggregates all control inputs. Training follows a progressive four-stage curriculum, gradually increasing video length and supervision diversity: monocular phone data →\rightarrow multi-view studio data →\rightarrow synthetic ViewSweep →\rightarrow synthetic DynamicSweep, with TT from 13 to 81 frames.

3. Large-Scale Dataset Construction for Disentangled Control

The effectiveness of FactorPortrait is underpinned by a comprehensive data curation pipeline composed of:

  • PhoneCapture: Real monocular video, 11,976 identities, ∼\sim2,000 frames/identity, 1440×10801440 \times 1080.
  • StudioCapture: Real multiview, 78 cameras, 612 identifies, 2048×13342048 \times 1334.
  • ViewSweep: Synthetic, 802 identities, 128 trajectories/identity, 100 frames/trajectory, 1024×10241024 \times 1024, static expressions but dynamic camera trajectory.
  • DynamicSweep: Synthetic, 802 identities, 32 sequences/identity, 128 frames/sequence, joint view and expression/head pose change.

Animatable Gaussian splatting avatars are fit to high-quality multi-view captures, supporting latent factor manipulation in the renderings. Across all splits, each video exhibits substantial diversity in expressions, head pose, and viewpoint, with over 50 changes per 100-128 frames.

4. Comparative Evaluation and Quantitative Results

Evaluation leverages both standard low-level and high-level measures:

  • Image metrics: PSNR (↑), SSIM (↑), LPIPS (↓)
  • Identity preservation: CSIM (ArcFace, ↑)
  • Expression accuracy: AED (3DMM expression coefficient error, ↓)
  • Video metrics: FID-VID (↓), FVD (↓), and IQA (visual quality score, ↑).

Results on held-out sequences from each dataset show state-of-the-art performance. Table 1 summarizes results for the main datasets (best value per metric in bold):

Dataset PSNR ↑ SSIM ↑ LPIPS ↓ CSIM ↑ AED ↓ IQA ↑ FID ↓ FVD ↓
PhoneCapture 24.68 82.85 0.071 86.15 0.203 71.16 21.49 0.007
StudioCapture 24.45 83.80 0.118 85.15 0.137 66.81 45.28 0.025
ViewSweep 23.25 84.55 0.133 81.62 0.136 60.77 19.51 0.011
DynamicSweep 22.95 83.58 0.137 79.98 0.207 61.00 20.68 0.008

Comparative baselines, including GAGAvatar [Chu et al., 2024], CAP4D [Taubner et al., 2024], and HunyuanPortrait [Xu et al., 2025], are outperformed across all settings, especially for identity preservation, sharpness, and view consistency.

5. Ablation Analysis and Design Validations

Ablation studies validate key architectural choices:

  • Removal of DynamicSweep data leads to a substantial drop across PSNR, SSIM, CSIM, and FID, confirming the requirement for synthetic joint control data.
  • Replacing normal maps (pose) with 2D landmarks severely reduces identity and pose consistency (e.g., CSIM drops from 78.82 to 65.16).
  • Substituting expression latents with 2D landmarks deteriorates fine expression accuracy and appearance details (AED degrades from 0.212 to 0.290, FID increases from 20.68 to 53.78).

These results confirm that full disentangled control—synthetic training, rendered normals for pose, implicit expression latents, and AdaLN-based modulation—is essential for achieving precise and realistic animation.

6. Limitations and Prospects

Present limitations include:

  • Upper-body only: Modeling is confined to head and upper torso; hands and full-body articulation are not addressed.
  • Inference speed: DiT-based backbone is computationally intensive and unsuitable for real-time applications.
  • Lighting disentanglement: The system does not currently provide explicit control or modeling of scene illumination or relighting.

Future research aims include extending control to full-body avatar animation, integrating illumination control for relighting, and optimizing for real-time inference (Tang et al., 12 Dec 2025).

7. Summary and Context

FactorPortrait is the first video diffusion framework to simultaneously and fully disentangle the control of facial expression, head pose, and arbitrary camera viewpoints for single-image-driven portrait animation. The method introduces a parameter-efficient expression controller using AdaLN, a sophisticated fusion of control signals, and a uniquely constructed large-scale mixed real-synthetic training corpus. Comprehensive quantitative and qualitative evaluation demonstrates substantial improvements over established baselines, particularly in realism, identity consistency, accurate expression reproduction, and novel view rendering. These contributions establish FactorPortrait as a state-of-the-art method and a foundation for further controlled generative modeling in portrait animation (Tang et al., 12 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to FactorPortrait.