UniAVGen: Unified Audio-Video Generation

Updated 9 November 2025

UniAVGen is a unified audio–video generation framework that employs a symmetric dual Diffusion Transformer architecture to represent and generate cross-modal signals.
It implements asymmetric cross-modal interaction, face-aware modulation, and modality-aware classifier-free guidance to enhance temporal alignment and semantic consistency.
The framework achieves high-fidelity synchronization and quality with sample efficiency, using only 1.3M paired AV samples compared to models requiring up to 30.7M samples.

UniAVGen is a unified audio–video generation framework that addresses the persistent challenge of cross-modal synchronization and semantic consistency in open-source generative models. Built on a structurally symmetric dual Diffusion Transformer (DiT) backbone, UniAVGen implements Asymmetric Cross-Modal Interaction for temporally aligned bidirectional attention, Face-Aware Modulation for spatially selective fusion, and Modality-Aware Classifier-Free Guidance to explicitly amplify cross-modal signals during inference. Its joint synthesis paradigm enables a single model to perform joint audio–video generation, cross-modal continuation, video-to-audio dubbing, and audio-driven video synthesis with markedly fewer paired training examples than prior solutions.

1. Dual-Branch Joint Synthesis Architecture

UniAVGen comprises two parallel DiT streams: one dedicated to video and one to audio. Both streams are constructed on transformer backbones with matched depth, attention heads, and feature dimensionality (Wan 2.2–5B for video; Wan 2.1–1.3B for audio), ensuring efficient representation of shared cross-modal latent space.

At each diffusion timestep $t$ , both branches receive:

a reference latent (from a reference frame or audio segment),
a conditional latent (enabling continuation and controllability),
the current noisy latent,
and modality-specific textual embeddings.

Video Stream

Input video $X^v$ is downsampled to 16 fps and encoded via a pre-trained VAE into latents $z^v \in \mathbb{R}^{L^v \times D}$ , with additional reference and conditional latents $z^{v_{ref}}, z^{v_{cond}}$ . These are concatenated temporally:

$z^{\hat v}_t = [z^{v_{ref}}_0,\,z^{v_{cond}}_0,\,z^v_t]$

A umT5-encoded prompt $e^v$ (“desired motion/expression”) is injected into every DiT block by cross-attention. The training objective follows flow matching:

$L^v = \|v_t(z^v_t) - u_{\theta_v}(z^{\hat v}_t, t, e^v)\|^2$

Audio Stream

Audio $X^a$ is resampled at 24 kHz and converted into Mel-spectrogram latents $z^a \in \mathbb{R}^{L^a \times D}$ using a VAE. Reference and conditional audio are similarly encoded. The DiT branch input for audio is:

$z^{\hat a}_t = [z^{a_{ref}}_0,\,z^{a_{cond}}_0,\,z^a_t]$

Textual prompts $T^a$ (“the text to be spoken”) are processed via a ConvNeXt stack yielding features $e^a$ . The audio loss mirrors the video:

$L^a = \|v_t(z^a_t) - u_{\theta_a}(z^{\hat a}_t, t, e^a)\|^2$

Both video and audio DiT streams are governed by standard latent diffusion forward processes:

$q(z_t|z_0) = \mathcal{N}(z_t;\,\alpha_t z_0,\,\sigma_t^2 I)$

At inference, the learned velocity field $u_\theta$ is integrated numerically (e.g., Euler–Maruyama) from $t=T \rightarrow 0$ ; the resulting latents are decoded into video or audio by their respective decoders.

UniAVGen’s core synchronization mechanism consists of two modality-specific aligner modules. These modules establish bidirectional, temporally aligned cross-attention between modalities.

Audio→Video Aligner (A2V)

Given video features $\hat H^v \in \mathbb{R}^{T \times N^v \times D}$ and audio features $\hat H^a \in \mathbb{R}^{T \times N^a \times D}$ , for each frame $i$ :

Aggregate an audio context window of width $w$ around $i$ :

$C^a_i = [\hat H^a_{i-w}, ..., \hat H^a_i, ..., \hat H^a_{i+w}]$

Perform cross-attention:
- $Q = W_q^v \hat H^v_i$
- $K, V = W_k^a C^a_i, W_v^a C^a_i$
- $\bar H^v_i = W_o^v[\hat H^v_i + \mathrm{CrossAttn}(Q, K, V)]$
Aggregate and add residually to $H^v$ :

$H^v \leftarrow H^v + \bar H^v$

Video→Audio Aligner (V2A)

For audio token $j$ mapping to video frames $(i, i+1)$ with interpolation $\alpha$ :

Interpolate video context:

$C^v_j = (1-\alpha)\hat H^v_i + \alpha\hat H^v_{i+1}$

Compute cross-attention:
- $Q = W_q^a \hat H^a_j$
- $K, V = W_k^v C^v_j, W_v^v C^v_j$
- $\bar H^a_j = W_o^a[\hat H^a_j + \mathrm{CrossAttn}(Q, K, V)]$
Aggregate and add residually to $H^a$ :

$H^a \leftarrow H^a + \bar H^a$

To prevent destabilization after single-modality pretraining, the $W_o^v$ and $W_o^a$ output projections are zero-initialized, so cross-modal flow begins with zero residual.

3. Face-Aware Modulation

To further boost lip synchronization and semantic coupling—especially in speech settings—Face-Aware Modulation (FAM) adaptively gates cross-modal interaction according to facial saliency.

FAM Mechanism

Compute per-layer mask $M^l$ (shape $T \times N^v$ ), using video features $H^{v_l}$ :

$M^l = \sigma \left( W_m \big(\gamma \odot \mathrm{LayerNorm}(H^{v_l}) + \beta \big) + b_m \right)$

with learned affine parameters $\gamma, \beta$ ; $\sigma$ is the sigmoid; $\odot$ is elementwise scaling.

Masks are supervised:

$L^m = \sum_l \|M^l - M^{gt}\|^2$

where $M^{gt}$ is the face region from RetinaFace; weighting $\lambda^m$ decays from 0.1 to 0 during training.

During cross-modal attention:
- A2V: Only face tokens are updated: $H^{v_l} \leftarrow H^{v_l} + M^l \odot \bar H^{v_l}$
- V2A: Only facial features attend back: $\hat H^{v_l} \leftarrow M^l \odot \hat H^{v_l}$

A plausible implication is that face-focused modulation efficiently allocates model capacity to the primary driver of audio–video synchronization—oral-facial motion during speech.

4. Modality-Aware Classifier-Free Guidance

Standard classifier-free guidance (CFG) applies a fixed rescaling to conditional and unconditional predictions per modality, potentially muting cross-modal coupling. UniAVGen implements Modality-Aware CFG (MA-CFG) that uses a joint unconditional pass (with cross-modal attention disabled) as a baseline for both streams, explicitly amplifying the cross-modal contribution.

Let

$u_{\theta_v}$ : video unconditional score (only text prompt),
$u_{\theta_a}$ : audio unconditional score,
$u_{\theta_{a,v}}$ : joint model score (full cross-modal interactions).

Final inference scores per modality:

$\begin{align*} \hat u_v &= u_{\theta_v} + s_v \left( u_{\theta_{a,v}} - u_{\theta_v} \right) \ \hat u_a &= u_{\theta_a} + s_a \left( u_{\theta_{a,v}} - u_{\theta_a} \right) \end{align*}$

for guidance scales $s_v, s_a >1$ , directly boosting cross-modal information flow, and thus strengthening the correlation of emotional/motion cues between audio and video.

5. Training Regime and Data Efficiency

UniAVGen’s training proceeds in three sequential stages:

Audio-only pretraining: 160K steps, batch 256, learning rate $2 \times 10^{-5}$ ; optimizes only $L^a$ .
End-to-end joint training: 30K steps, batch 32, learning rate $5 \times 10^{-6}$ , jointly optimizes $L^{\mathrm{joint}} = L^v + L^a + \lambda^m L^m$ on approximately 1.3M real-human AV video samples.
Multi-task fine-tuning: 10K steps, same hyperparameters, sampling five tasks in the ratio (joint gen:gen+audio ref:continuation:video→audio:audio→video) as 4:1:1:2:2.

This regime demonstrates marked sample efficiency: UniAVGen requires only $\sim$ 1.3 M paired AV samples, as opposed to 30.7 M in Ovi, and 6.4 M in UniVerse-1.

6. Empirical Performance and Comparative Results

Experimental evaluation is conducted using AudioBox-Aesthetics (Production Quality PQ, Content Usefulness CU), Whisper-large (WER), VBench (Subject Consistency SC, Dynamic Degree DD, Imaging Quality IQ), SyncNet (Lip Sync LS), and Gemini LLM for Timbre and Emotion Consistency (TC/EC). Key results are summarized in the following table:

Model	PQ	CU	WER	SC	DD	IQ	LS	TC	EC	Training Samples
UniAVGen	7.00	6.62	0.151	0.973	0.410	0.779	5.95	0.832	0.573	1.3 M
Ovi	6.03	6.01	0.216	0.972	0.360	0.774	6.48	0.828	0.558	30.7 M
UniVerse-1	4.56	4.29	0.296	0.985	0.08	0.733	1.21	0.573	0.300	6.4 M

UniAVGen outperforms all open-source joint-generation models in audio quality (PQ, CU), timbre and emotion consistency (TC, EC), and achieves state-of-the-art synchronization (LS) with an order-of-magnitude less training data. Qualitative analysis notes that UniAVGen and Ovi yield high-fidelity results for in-distribution human video; where Ovi and UniVerse-1 struggle with stylized or out-of-domain scenarios, UniAVGen maintains alignment and audio–visual coherence.

Ablations indicate:

Asymmetric, temporally aligned cross-modal interactions (ATI) yield the highest consistency early in training.
Supervised face-aware masks with decaying $\lambda^m$ substantially improve LS, TC, and EC compared to unsupervised or absent FAM.
MA-CFG enhances emotional and motion correlation.
The pretrain–joint–multi-task schedule achieves the fastest and highest convergence in consistency metrics.

7. Limitations and Prospective Directions

Despite its strengths, UniAVGen reveals a number of limitations and opportunities for future improvement:

All face-awareness is based on 2D spatial masks from supervised detection; occlusions or extreme poses may not be optimally captured.
The backbone size, while competitive, remains smaller than some proprietary models; further scaling and architectural refinements (dynamic depth, advanced regularization) are needed for continued improvement.
Current modeling does not exploit multi-speaker scenarios, fine-grained scene context outside facial regions, or longer-form cross-modal temporal dependencies.

A plausible implication is that with further data expansion and semi-supervised mask learning, as well as architectural adaptation for broader genre and speaker coverage, UniAVGen could extend its performance margin and approach the few remaining specialty strengths of closed commercial baselines.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to UniAVGen.