V-Express: Portrait Video Generation

Updated 2 June 2026

V-Express is a diffusion-based training paradigm that progressively balances strong and weak control signals for portrait video generation.
It employs conditional dropout to mask dominant controls, ensuring that weaker modalities like audio effectively guide lip synchronization.
A multi-stage training protocol with specialized attention and adapter modules yields state-of-the-art fidelity, pose alignment, and identity preservation.

V-Express (Conditional Dropout for Progressive Training of Portrait Video Generation) is a training paradigm designed to address the challenge of control imbalance in diffusion-based portrait video generation. It specifically targets the scenario where a single reference image, combined with various control signals of heterogeneous strength (such as pose/keypoints, reference identity, and audio), governs the generation of realistic, temporally consistent talking-head videos. The method systematically mitigates the dominance of strong signals—such as facial pose or reference images—over weak signals like audio, which are critical for achieving accurate lip synchronization. V-Express achieves this by utilizing a combination of progressive multi-stage training and a novel conditional dropout strategy, thereby enabling the simultaneous and effective use of multiple control modalities (Wang et al., 2024).

1. Motivation and Signal Dominance in Portrait Video Generation

In single-image portrait video generation with auxiliary controls (pose, audio, etc.), diffusion models often encounter a pronounced imbalance among conditioning signals. In typical workflows, strong signals (e.g., a reference face image or a dense pose trajectory via “V-Kps”) can dominate the conditional generative process. The model may satisfy these conditions by trivial mechanisms (copy-paste, warping) and consequently disregard the weak signal, such as the audio track that informs lip dynamics.

Direct joint training with all signal types fails to provide useful gradient flow to weak controls, resulting in trivial solutions where lip motion is unsynchronized with audio, and facial identity or pose overshadow subtler articulatory cues. This undermines the practical value of controlled generation, where fine-grained synchrony between modalities is essential for realism.

2. Core Method: Progressive Training and Conditional Dropout

V-Express explicitly counteracts control signal imbalance by (a) progressive training and (b) conditional dropout:

Progressive Training schedules the introduction of weak controls. It initially trains the generative backbone on strong signals (reference, pose) before staging in weaker ones (audio, temporal context). This ensures that the model first gains reliable spatial/identity grounding, then learns to incorporate weak modality signals.
Conditional Dropout randomly masks (i.e., zeroes out) the embeddings of strong control signals on selected frames during multi-frame training. This compels the model to reconstruct target outputs using only the available signals, allowing the previously-ignored weaker modalities (e.g., audio) to propagate gradient and gain representational capacity.

Architectural Composition

The backbone is a latent diffusion model based on Stable Diffusion 1.5 U-Net, expanded with four distinct attention mechanisms per block:
- Self-attention (spatial)
- Reference image cross-attention
- Audio cross-attention
- Motion (temporal) attention
Three adapters encode the control streams:
- ReferenceNet: a static encoder for the reference image
- V-Kps Guider: a CNN mapping pose keypoints
- Audio Projection: Wav2Vec2 and Q-Former for audio window embeddings

3. Mathematical Formulation

The training objective and key modules are governed by the following core formulations:

Diffusion Denoising Loss for each timestep:

$\mathcal{L}_{\rm denoise} = \mathbb{E}_{\mathbf{z}_0,\boldsymbol\epsilon,t} \|\epsilon_\theta(\mathbf{z}_t,\mathbf{c},t) - \boldsymbol\epsilon\|^2$

where $\mathbf{z}_t$ is the noise-corrupted latent, $\mathbf{c}$ encapsulates all conditions.

Conditional Dropout: For each frame $i$ ,

$\tilde{c}_{\rm ref}^{(i)} = \begin{cases} 0, & \text{with probability } p_{\rm ref}(t) \ c_{\rm ref}, & \text{otherwise} \end{cases} \qquad \tilde{c}_{\rm kps}^{(i)} = \begin{cases} 0, & \text{with probability } p_{\rm kps}(t) \ c_{\rm kps}^{(i)}, & \text{otherwise} \end{cases}$

Mouth-Weighted Loss:

$\mathcal{L}_{\rm mouth} = \mathbb{E} \|M_{\rm mouth} \odot [\epsilon_\theta(\mathbf{z}_t,\mathbf{c},t)-\boldsymbol\epsilon]\|^2$

emphasizing error within the mouth region for enhanced lip-audio synchrony.

Full Training Loss:

$\mathcal{L} = \mathcal{L}_{\rm denoise} + \lambda_{\rm mouth} \mathcal{L}_{\rm mouth}$

with $\lambda_{\rm mouth}=100$ . No adversarial or perceptual losses are employed.

4. Training Protocol and Implementation

Training proceeds in three stages:

Stage I (Single-Frame Training):

Inputs: a randomly sampled frame, reference image, and pose keypoints.
Only spatial modules (ReferenceNet, V-Kps Guider, U-Net spatial attention) are trained; audio/motion layers are zero-initialized and frozen.

Stage II (Multi-Frame Training Without Global Fine-Tuning):

Inputs: video clips (12 frames), full control suite (reference, keypoints, audio).
Only audio modules and motion attention are updated; others are frozen.
Conditional dropout is introduced (e.g., $p_{\rm kps}=0.5$ , $p_{\rm ref}=0.2$ ) to encourage audio reliance.

Stage III (Global Fine-Tuning):

All modules are unfrozen for end-to-end optimization; conditional dropout persists.

Pseudo-code for Stages II/III is provided, detailing dropout masking per-frame, condition encoding, forward pass, and loss computation. Adam optimizer is used (learning rate $\mathbf{z}_t$ 0, batch size 4, learning rate schedule with linear warm-up and cosine decay).

5. Quantitative Evaluation and Comparative Performance

Experiments utilize HDTF (300 hours) and VFHQ datasets for training (resolution 512 $\mathbf{z}_t$ 1512). Evaluation occurs on TalkingHead-1KH and AVSpeech test splits (100 videos each), with the following metrics:

FID: image quality (lower is better)
FVD: video quality (lower is better)
$\mathbf{z}_t$ 2FaceSim: identity preservation gap (lower is better)
KpsDis: mean pose keypoint distance (lower is better)
SyncNet: lip-audio synchrony (higher is better)

	FID ↓	FVD ↓	ΔFaceSim ↓	KpsDis ↓	Sync ↑
TalkingHead-1KH
Wav2Lip	29.06	250.95	10.77	41.60	6.89
DiffusedHeads	115.41	344.69	–	–	–
V-Express	25.81	135.82	3.63	3.28	3.48
AVSpeech
Wav2Lip	26.71	251.40	7.28	42.93	6.62
DiffusedHeads	104.61	363.10	–	–	–
V-Express	23.38	117.93	0.28	2.78	3.79

V-Express attains state-of-the-art image and video fidelity as well as pose and identity alignment. SyncNet scores remain below Wav2Lip as no specialist lip-sync expert module is used.

6. Ablation Analysis: Conditional Dropout and Progressive Training

Ablation studies dissect the contribution of the two core innovations:

Effect of Conditional Dropout: For $\mathbf{z}_t$ 3, performance improves dramatically (KpsDis from 30 px to 3.3 px; FVD from 300 to 136; SyncNet from 2.1 to 3.4), revealing its efficacy in enforcing model dependence on audio and temporal context for dropped frames.
Effect of Progressive Training: Three protocols are compared:
- Single-stage joint training (A): 8.2/28.4/2.0 (ΔFaceSim/KpsDis/Sync)
- Two-stage with skipped Stage II (B): 5.1/10.2/2.5
- Full multi-stage (C, V-Express): 3.6/3.3/3.5

Sequentially adding and then fine-tuning audio/motion modules produces the best modality balance.

7. Limitations and Prospective Extensions

Identified limitations of V-Express include:

Generation speed is constrained by iterative diffusion denoising, impeding real-time deployment.
Audio encoder is tuned for English; cross-lingual performance (e.g., Chinese) is degraded.
Explicit control of facial attributes—such as expression—remains unsupported.

Potential directions for future work:

Incorporation of multilingual encoders (e.g., Whisper+) to improve language coverage.
Use of latent consistency models (LCM, LCM-LoRA) for faster sampling.
Adding cross-attention adapters trained with attribute-annotated data to control expression or lighting directly.

In sum, V-Express demonstrates that the progressive activation of weak controls (such as audio) and the strategic dropout of dominant controls (such as pose/reference) enable diffusion networks to generate high-fidelity, identity- and pose-consistent, lip-synchronized portrait videos without adversarial losses or expert lip-sync discriminators (Wang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to V-Express.