Joint Audio-Visual Diffusion Models
- Joint audio-visual diffusion is a multimodal framework that employs denoising diffusion to synchronize and generate audio and visual signals.
- It leverages architectures like dual-stream transformers and shared backbones with cross-modal attention to enhance feature fusion and temporal alignment.
- Training objectives combine noise prediction, contrastive, and synchrony losses to achieve high fidelity synthesis and efficient multimodal pre-training.
Joint audio-visual diffusion refers to a class of generative and representation learning frameworks that utilize diffusion processes to jointly model, synthesize, or reconstruct audio and visual modalities in a synchronized and semantically coherent manner. These approaches extend denoising diffusion probabilistic models—originally developed for single-modality image or audio synthesis—so as to handle the complexity of correlated spatiotemporal and acoustic signals. Recent work demonstrates that joint audio-visual diffusion frameworks not only enhance generative fidelity and alignment but also enable unified multi-task learning and efficient pre-training for downstream multimodal tasks.
1. Foundations of Joint Audio-Visual Diffusion
Joint audio-visual diffusion models generalize the Markovian forward/reverse diffusion mechanism, decomposing the multimodal generative process into repeated denoising (or velocity-matching) steps over shared or coordinated latent representations for both modalities. The core principle is to treat a coupled state —where denotes audio and video latents—as the target of a joint forward corruption process. A trainable denoising model (often parameterized by a multi-modal transformer or U-Net variant) then learns to invert this process by predicting either the original data, noise, or a velocity field.
Several foundational mathematical formalizations are prominent. MM-Diffusion (Ruan et al., 2022), CMMD (Yang et al., 2023), AV-DiT (Wang et al., 2024), LTX-2 (HaCohen et al., 6 Jan 2026), and 3MDiT (Li et al., 26 Nov 2025) parameterize joint diffusion either via stacked latents, concatenated state, or coordinated velocity fields, always synchronizing the temporal axis to ensure fine-grained alignment (e.g., for lip motion and speech). The loss is most often a sum of per-modality mean squared errors on predicted noise, although more recent frameworks incorporate joint contrastive (Yang et al., 2023), flow-matching (Li et al., 26 Nov 2025, Chen et al., 29 Jan 2026), or synchrony terms.
2. Model Architectures and Feature Fusion Strategies
Recent models leverage advanced neural architectures for joint denoising. Early frameworks used two separate but coupled U-Nets with cross-modal attention points (MM-Diffusion (Ruan et al., 2022)) or random-shift attention blocks for efficient alignment. This has evolved into several dominant strategies:
- Dual-stream Transformers: Models such as LTX-2 (HaCohen et al., 6 Jan 2026) and Seedance 1.5 pro (Chen et al., 15 Dec 2025) employ asymmetric dual-stream diffusion transformers, coupling a high-capacity video stream with a lower-capacity audio stream, synchronizing them via bidirectional cross-attention and adaptive normalization, and injecting cross-modal information at every block.
- Shared or Fusion-augmented Backbones: AV-DiT (Wang et al., 2024) freezes a large ViT-based DiT backbone, inserting lightweight modality-specific adapters and fusion blocks embracing LoRA and temporal attention for joint feature modeling, which provides parameter efficiency while maintaining cross-modal alignment.
- Omni-blocks and Gating Units: 3MDiT (Li et al., 26 Nov 2025) introduces tri-modal omni-blocks that perform joint self-attention over video, audio, and textual embeddings, coupled with dynamic text conditioning.
- Cross-Modal Attention and Fusion: Frameworks such as CMMD (Yang et al., 2023) (Easy Fusion), MM-Diffusion (Ruan et al., 2022) (RS-MMA), and AV-Edit (Guo et al., 26 Nov 2025) (correlation-based gating) align temporally and spatially co-located audio and video features before or within attention modules to enforce synchrony.
- Latent-space Interoperability: Approaches as in Seeing and Hearing (Xing et al., 2024) leverage pretrained encoders (e.g., ImageBind) to embed both modalities into a shared space, enabling latent-level alignment and gradient-based steering for improved cross-modal alignment.
3. Diffusion Process and Training Objectives
The core generative dynamic consists of a forward process: where denotes the clean joint audio-visual latent, and advances along a fixed or learnable noise schedule. The reverse process uses a parameterized denoiser to learn either noise predictions (DDPM-style), the original signal, or, in newer models, the velocity under a probability-flow ODE (flow-matching).
Loss construction varies:
- Reconstruction/Noise Prediction Loss: Mean squared error between predicted and actual noise, either jointly on audio and video or with task-adaptive masking (UniForm (Zhao et al., 6 Feb 2025)).
- Contrastive and Synchrony Losses: Many models (DiffMAViL (Nunez et al., 2023), CMMD (Yang et al., 2023), Seedance 1.5 pro (Chen et al., 15 Dec 2025), AV-Edit (Guo et al., 26 Nov 2025)) include InfoNCE-style losses or explicit AV synchronization losses to ensure that the denoiser learns high-level temporal and semantic alignment.
- Guidance and Regularization: Classifier-free or modality-aware guidance (e.g., modality-CFG in LTX-2 (HaCohen et al., 6 Jan 2026)) can be used during sampling. Some models introduce curriculum schedules, masking ratio annealing, batch size adaptation, or feature gating for optimization stability and compute efficiency (Nunez et al., 2023, Guo et al., 26 Nov 2025).
4. Conditioning, Multi-task Learning, and Generation Regimes
Joint audio-visual diffusion systems are engineered for broad conditional flexibility, supporting:
- Unconditional Joint Generation: Both audio and video are sampled from pure noise, capturing natural distributional correlates (Ruan et al., 2022).
- Conditional Generation: Video-to-audio (V2A), audio-to-video (A2V), and text-to-audio-video (T2AV) tasks are supported through task tokens, per-modality noise schedules, or special conditioning architectures (e.g., UniForm (Zhao et al., 6 Feb 2025), AVLDM (Kim et al., 2024), JUST-DUB-IT (Chen et al., 29 Jan 2026)).
- Editing and Inpainting: Architectures such as AV-Edit (Guo et al., 26 Nov 2025) and Language-Guided Joint Audio-Visual Editing (Liang et al., 2024) use a combination of masking, cross-modal attention modulation, and feature gating to support precise sound effect editing and scene-level audio-visual edits.
- Flexible Inference via Mixture-of-Noise: AVLDM (Kim et al., 2024) parameterizes the forward noising schedule per modality and time segment, enabling arbitrary conditioning (cross-modal, temporal inpainting, or continuation) at inference by selectively fixing or corrupting different subsets of the latent state.
- Discriminator/Guidance-Augmented Generation: MMDisCo (Hayakawa et al., 2024) introduces a lightweight joint discriminator that adjusts the scores of pre-trained single-modal diffusion models, enabling them to sample from a well-aligned joint distribution without full retraining.
5. Empirical Performance and Benchmarking
Empirical studies consistently demonstrate the superiority of joint diffusion models in both unconditional and conditional audio-visual synthesis, in terms of both modality-specific and alignment metrics.
| Model | FVD (↓) | FAD (↓) | AV-Align (↑) | Params (M) | Inference Speed (samp/sec) |
|---|---|---|---|---|---|
| MM-Diffusion (Ruan et al., 2022) | 98.69 | 10.58 | — | 426 | 0.009 |
| AV-DiT (Wang et al., 2024) | 68.88 | 10.17 | — | 159.91 | 0.032 |
| 3MDiT (SD3-adapt) (Li et al., 26 Nov 2025) | 424.1 | 2.03 | 0.627 | — | — |
| Seedance 1.5 pro (Chen et al., 15 Dec 2025) | — | — | 82% GSB | — | — |
Objective measures such as Fréchet Video Distance (FVD), Fréchet Audio Distance (FAD), and AV-Align (alignment between modality peaks) are standard. Recent models, particularly AV-DiT and LTX-2, demonstrate large gains in sample fidelity and efficiency, while 3MDiT reports substantial improvements in AV-Align () compared to previous joint models (typically ). Subjective metrics—MOS, lip-sync MAE, human preference raters—are routinely applied to validate cross-modal synchrony and naturalness.
Efficiency and scalability are core design criteria. For instance, DiffMAViL (Nunez et al., 2023) achieves a 32% reduction in pre-training FLOPS and an 18% decrease in wall-clock time through masking curricula and batch adaptation, without performance degradation. Seedance 1.5 pro (Chen et al., 15 Dec 2025) achieves faster inference via sampler distillation and quantization.
6. Applications and Extensions
Joint audio-visual diffusion frameworks are being deployed in a wide range of scenarios:
- Foundational Joint Generation: Seedance 1.5 pro (Chen et al., 15 Dec 2025), LTX-2 (HaCohen et al., 6 Jan 2026), and 3MDiT (Li et al., 26 Nov 2025) serve as backbone models for open-domain generation of synchronized, semantically rich audio-visual content, including support for multilingual text prompts, complex foley soundscapes, and narrative coherence.
- Conditional Editing and Dubbing: JUST-DUB-IT (Chen et al., 29 Jan 2026) demonstrates end-to-end video dubbing with LoRA-based low-rank adaptation to new languages, robust identity and lip-synchronization preservation, and resilience against complex dynamics and occlusions. AV-Edit (Guo et al., 26 Nov 2025) enables fine-grained sound effect insertion and removal in the context of visual content.
- Pre-training and Representation Learning: DiffMAViL (Nunez et al., 2023) unites contrastive, masked reconstruction and diffusion, yielding a powerful pre-training pipeline for both audio and video tasks with efficient transfer to downstream scenarios.
- Saliency and Semantic Modeling: DiffSal (Xiong et al., 2024) extends the paradigm to conditional generative tasks such as saliency prediction, where the objective is pixelwise heatmap inference given full audio-visual context.
- Speech Separation: AVDiffuSS (Lee et al., 2023) employs audio-visual diffusion with cross-attention fusion for surpassing state-of-the-art source separation, especially in challenging multichannel conversational contexts.
7. Limitations and Prospects
Despite these advances, several limitations remain:
- Resolution and Fine Detail: Many models rely on VAE bottlenecks or low-resolution latents, impeding transfer to high-fidelity video or complex acoustic domains (Zhao et al., 6 Feb 2025, Wang et al., 2024).
- Efficiency at Scale: Vanilla DDPM samplers require steps per sample; although recent progress with DDIM, DPM-Solver, and distillation mitigates this, true real-time synthesis at high resolution remains an open target (Chen et al., 15 Dec 2025).
- Alignment and Control: While explicit synchrony metrics and guidance mechanisms have improved, robust control over AV alignment in arbitrary scenes—particularly in zero-shot or long-form content—remains an active research direction (Xing et al., 2024, Li et al., 26 Nov 2025).
- Modality and Task Generalization: Extending existing frameworks to cover further modalities (e.g., text, motion capture, scene graphs) or new tasks (editing, inpainting, continuation) brings with it challenges in model scaling, training data diversity, and inference flexibility (Kim et al., 2024).
- Data Bias and Robustness: Model performance still depends heavily on the scale and bias of the pre-training data, which can manifest as failure modes in cross-lingual or open-domain scenarios (Chen et al., 29 Jan 2026, HaCohen et al., 6 Jan 2026).
A plausible implication is that future research will expand upon dynamic fusion mechanisms, adaptive guidance, higher-fidelity conditional modeling, and architectural efficiency to bridge remaining quality and scalability gaps.
Key references: (Ruan et al., 2022, Nunez et al., 2023, Yang et al., 2023, Xing et al., 2024, Xiong et al., 2024, Kim et al., 2024, Hayakawa et al., 2024, Wang et al., 2024, Zhao et al., 6 Feb 2025, Guo et al., 26 Nov 2025, Li et al., 26 Nov 2025, Chen et al., 15 Dec 2025, HaCohen et al., 6 Jan 2026, Chen et al., 29 Jan 2026)