Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

UniAVGen: Unified Audio-Video Generation

Updated 9 November 2025
  • UniAVGen is a unified audio–video generation framework that employs a symmetric dual Diffusion Transformer architecture to represent and generate cross-modal signals.
  • It implements asymmetric cross-modal interaction, face-aware modulation, and modality-aware classifier-free guidance to enhance temporal alignment and semantic consistency.
  • The framework achieves high-fidelity synchronization and quality with sample efficiency, using only 1.3M paired AV samples compared to models requiring up to 30.7M samples.

UniAVGen is a unified audio–video generation framework that addresses the persistent challenge of cross-modal synchronization and semantic consistency in open-source generative models. Built on a structurally symmetric dual Diffusion Transformer (DiT) backbone, UniAVGen implements Asymmetric Cross-Modal Interaction for temporally aligned bidirectional attention, Face-Aware Modulation for spatially selective fusion, and Modality-Aware Classifier-Free Guidance to explicitly amplify cross-modal signals during inference. Its joint synthesis paradigm enables a single model to perform joint audio–video generation, cross-modal continuation, video-to-audio dubbing, and audio-driven video synthesis with markedly fewer paired training examples than prior solutions.

1. Dual-Branch Joint Synthesis Architecture

UniAVGen comprises two parallel DiT streams: one dedicated to video and one to audio. Both streams are constructed on transformer backbones with matched depth, attention heads, and feature dimensionality (Wan 2.2–5B for video; Wan 2.1–1.3B for audio), ensuring efficient representation of shared cross-modal latent space.

At each diffusion timestep tt, both branches receive:

  • a reference latent (from a reference frame or audio segment),
  • a conditional latent (enabling continuation and controllability),
  • the current noisy latent,
  • and modality-specific textual embeddings.

Video Stream

Input video XvX^v is downsampled to 16 fps and encoded via a pre-trained VAE into latents zvRLv×Dz^v \in \mathbb{R}^{L^v \times D}, with additional reference and conditional latents zvref,zvcondz^{v_{ref}}, z^{v_{cond}}. These are concatenated temporally:

ztv^=[z0vref,z0vcond,ztv]z^{\hat v}_t = [z^{v_{ref}}_0,\,z^{v_{cond}}_0,\,z^v_t]

A umT5-encoded prompt eve^v (“desired motion/expression”) is injected into every DiT block by cross-attention. The training objective follows flow matching:

Lv=vt(ztv)uθv(ztv^,t,ev)2L^v = \|v_t(z^v_t) - u_{\theta_v}(z^{\hat v}_t, t, e^v)\|^2

Audio Stream

Audio XaX^a is resampled at 24 kHz and converted into Mel-spectrogram latents zaRLa×Dz^a \in \mathbb{R}^{L^a \times D} using a VAE. Reference and conditional audio are similarly encoded. The DiT branch input for audio is:

zta^=[z0aref,z0acond,zta]z^{\hat a}_t = [z^{a_{ref}}_0,\,z^{a_{cond}}_0,\,z^a_t]

Textual prompts TaT^a (“the text to be spoken”) are processed via a ConvNeXt stack yielding features eae^a. The audio loss mirrors the video:

La=vt(zta)uθa(zta^,t,ea)2L^a = \|v_t(z^a_t) - u_{\theta_a}(z^{\hat a}_t, t, e^a)\|^2

Both video and audio DiT streams are governed by standard latent diffusion forward processes:

q(ztz0)=N(zt;αtz0,σt2I)q(z_t|z_0) = \mathcal{N}(z_t;\,\alpha_t z_0,\,\sigma_t^2 I)

At inference, the learned velocity field uθu_\theta is integrated numerically (e.g., Euler–Maruyama) from t=T0t=T \rightarrow 0; the resulting latents are decoded into video or audio by their respective decoders.

2. Asymmetric Cross-Modal Interaction

UniAVGen’s core synchronization mechanism consists of two modality-specific aligner modules. These modules establish bidirectional, temporally aligned cross-attention between modalities.

Audio→Video Aligner (A2V)

Given video features H^vRT×Nv×D\hat H^v \in \mathbb{R}^{T \times N^v \times D} and audio features H^aRT×Na×D\hat H^a \in \mathbb{R}^{T \times N^a \times D}, for each frame ii:

  1. Aggregate an audio context window of width ww around ii:

Cia=[H^iwa,...,H^ia,...,H^i+wa]C^a_i = [\hat H^a_{i-w}, ..., \hat H^a_i, ..., \hat H^a_{i+w}]

  1. Perform cross-attention:
    • Q=WqvH^ivQ = W_q^v \hat H^v_i
    • K,V=WkaCia,WvaCiaK, V = W_k^a C^a_i, W_v^a C^a_i
    • Hˉiv=Wov[H^iv+CrossAttn(Q,K,V)]\bar H^v_i = W_o^v[\hat H^v_i + \mathrm{CrossAttn}(Q, K, V)]
  2. Aggregate and add residually to HvH^v:

HvHv+HˉvH^v \leftarrow H^v + \bar H^v

Video→Audio Aligner (V2A)

For audio token jj mapping to video frames (i,i+1)(i, i+1) with interpolation α\alpha:

  1. Interpolate video context:

Cjv=(1α)H^iv+αH^i+1vC^v_j = (1-\alpha)\hat H^v_i + \alpha\hat H^v_{i+1}

  1. Compute cross-attention:
    • Q=WqaH^jaQ = W_q^a \hat H^a_j
    • K,V=WkvCjv,WvvCjvK, V = W_k^v C^v_j, W_v^v C^v_j
    • Hˉja=Woa[H^ja+CrossAttn(Q,K,V)]\bar H^a_j = W_o^a[\hat H^a_j + \mathrm{CrossAttn}(Q, K, V)]
  2. Aggregate and add residually to HaH^a:

HaHa+HˉaH^a \leftarrow H^a + \bar H^a

To prevent destabilization after single-modality pretraining, the WovW_o^v and WoaW_o^a output projections are zero-initialized, so cross-modal flow begins with zero residual.

3. Face-Aware Modulation

To further boost lip synchronization and semantic coupling—especially in speech settings—Face-Aware Modulation (FAM) adaptively gates cross-modal interaction according to facial saliency.

FAM Mechanism

  1. Compute per-layer mask MlM^l (shape T×NvT \times N^v), using video features HvlH^{v_l}:

Ml=σ(Wm(γLayerNorm(Hvl)+β)+bm)M^l = \sigma \left( W_m \big(\gamma \odot \mathrm{LayerNorm}(H^{v_l}) + \beta \big) + b_m \right)

with learned affine parameters γ,β\gamma, \beta; σ\sigma is the sigmoid; \odot is elementwise scaling.

  1. Masks are supervised:

Lm=lMlMgt2L^m = \sum_l \|M^l - M^{gt}\|^2

where MgtM^{gt} is the face region from RetinaFace; weighting λm\lambda^m decays from 0.1 to 0 during training.

  1. During cross-modal attention:
    • A2V: Only face tokens are updated: HvlHvl+MlHˉvlH^{v_l} \leftarrow H^{v_l} + M^l \odot \bar H^{v_l}
    • V2A: Only facial features attend back: H^vlMlH^vl\hat H^{v_l} \leftarrow M^l \odot \hat H^{v_l}

A plausible implication is that face-focused modulation efficiently allocates model capacity to the primary driver of audio–video synchronization—oral-facial motion during speech.

4. Modality-Aware Classifier-Free Guidance

Standard classifier-free guidance (CFG) applies a fixed rescaling to conditional and unconditional predictions per modality, potentially muting cross-modal coupling. UniAVGen implements Modality-Aware CFG (MA-CFG) that uses a joint unconditional pass (with cross-modal attention disabled) as a baseline for both streams, explicitly amplifying the cross-modal contribution.

Let

  • uθvu_{\theta_v}: video unconditional score (only text prompt),
  • uθau_{\theta_a}: audio unconditional score,
  • uθa,vu_{\theta_{a,v}}: joint model score (full cross-modal interactions).

Final inference scores per modality:

u^v=uθv+sv(uθa,vuθv) u^a=uθa+sa(uθa,vuθa)\begin{align*} \hat u_v &= u_{\theta_v} + s_v \left( u_{\theta_{a,v}} - u_{\theta_v} \right) \ \hat u_a &= u_{\theta_a} + s_a \left( u_{\theta_{a,v}} - u_{\theta_a} \right) \end{align*}

for guidance scales sv,sa>1s_v, s_a >1, directly boosting cross-modal information flow, and thus strengthening the correlation of emotional/motion cues between audio and video.

5. Training Regime and Data Efficiency

UniAVGen’s training proceeds in three sequential stages:

  1. Audio-only pretraining: 160K steps, batch 256, learning rate 2×1052 \times 10^{-5}; optimizes only LaL^a.
  2. End-to-end joint training: 30K steps, batch 32, learning rate 5×1065 \times 10^{-6}, jointly optimizes Ljoint=Lv+La+λmLmL^{\mathrm{joint}} = L^v + L^a + \lambda^m L^m on approximately 1.3M real-human AV video samples.
  3. Multi-task fine-tuning: 10K steps, same hyperparameters, sampling five tasks in the ratio (joint gen:gen+audio ref:continuation:video→audio:audio→video) as 4:1:1:2:2.

This regime demonstrates marked sample efficiency: UniAVGen requires only \sim1.3 M paired AV samples, as opposed to 30.7 M in Ovi, and 6.4 M in UniVerse-1.

6. Empirical Performance and Comparative Results

Experimental evaluation is conducted using AudioBox-Aesthetics (Production Quality PQ, Content Usefulness CU), Whisper-large (WER), VBench (Subject Consistency SC, Dynamic Degree DD, Imaging Quality IQ), SyncNet (Lip Sync LS), and Gemini LLM for Timbre and Emotion Consistency (TC/EC). Key results are summarized in the following table:

Model PQ CU WER SC DD IQ LS TC EC Training Samples
UniAVGen 7.00 6.62 0.151 0.973 0.410 0.779 5.95 0.832 0.573 1.3 M
Ovi 6.03 6.01 0.216 0.972 0.360 0.774 6.48 0.828 0.558 30.7 M
UniVerse-1 4.56 4.29 0.296 0.985 0.08 0.733 1.21 0.573 0.300 6.4 M

UniAVGen outperforms all open-source joint-generation models in audio quality (PQ, CU), timbre and emotion consistency (TC, EC), and achieves state-of-the-art synchronization (LS) with an order-of-magnitude less training data. Qualitative analysis notes that UniAVGen and Ovi yield high-fidelity results for in-distribution human video; where Ovi and UniVerse-1 struggle with stylized or out-of-domain scenarios, UniAVGen maintains alignment and audio–visual coherence.

Ablations indicate:

  • Asymmetric, temporally aligned cross-modal interactions (ATI) yield the highest consistency early in training.
  • Supervised face-aware masks with decaying λm\lambda^m substantially improve LS, TC, and EC compared to unsupervised or absent FAM.
  • MA-CFG enhances emotional and motion correlation.
  • The pretrain–joint–multi-task schedule achieves the fastest and highest convergence in consistency metrics.

7. Limitations and Prospective Directions

Despite its strengths, UniAVGen reveals a number of limitations and opportunities for future improvement:

  • All face-awareness is based on 2D spatial masks from supervised detection; occlusions or extreme poses may not be optimally captured.
  • The backbone size, while competitive, remains smaller than some proprietary models; further scaling and architectural refinements (dynamic depth, advanced regularization) are needed for continued improvement.
  • Current modeling does not exploit multi-speaker scenarios, fine-grained scene context outside facial regions, or longer-form cross-modal temporal dependencies.

A plausible implication is that with further data expansion and semi-supervised mask learning, as well as architectural adaptation for broader genre and speaker coverage, UniAVGen could extend its performance margin and approach the few remaining specialty strengths of closed commercial baselines.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UniAVGen.