AVFullDiT: Joint Audio-Video Diffusion Model
- AVFullDiT is a class of parameter-efficient transformer models that jointly denoise and synchronize audio and video modalities.
- It integrates pre-trained Diffusion Transformers using specialized fusion mechanisms like AVFull-Attention and aligned positional encodings for robust cross-modal alignment.
- Empirical results indicate improved video realism, audio fidelity, and temporal synchrony, often outperforming state-of-the-art unimodal systems.
Audio-Video Full DiT (AVFullDiT) refers to a class of parameter-efficient, transformer-based latent diffusion models that jointly model, denoise, and synchronize both video and audio modalities. These architectures leverage pre-trained Diffusion Transformers (DiTs) for text-to-video (T2V) and text-to-audio (T2A) generation, integrating them through specialized fusion mechanisms—such as AVFull-Attention and aligned positional encodings—to enable joint cross-modal denoising and generation. AVFullDiT systems are positioned as a unified alternative to traditional cascaded or weakly coupled pipelines, establishing improved cross-modal temporal and semantic coherence while often maintaining or exceeding state-of-the-art unimodal quality for both audio and video.
1. Architectural Foundations and Variants
AVFullDiT architectures universally employ a fusion of pre-trained or symmetrically co-designed Diffusion Transformers for audio and video. The canonical architecture (Wu et al., 2 Dec 2025) creates two “unimodal towers” (T2V and T2A DiTs) retained from their respective pre-trained states, followed by a sequence of joint “AVFull-Attention” blocks in which audio and video tokens are concatenated and fused within a single multi-head attention (MHSA) module. Modality alignment is handled by inserting small adapter layers so that channel dimensions match (e.g., audio-adapter projections), avoiding any need for re-initializing multi-billion parameter backbones.
Other models implement distinct fusion strategies:
- Twin-Backbone Cross-Modal Fusion: Ovi initializes two identical DiT towers for audio and video, jointly trained with dense, bidirectional cross-attention and temporal alignment via scaled rotary positional embeddings (RoPE). Audio representations are adapted to the length of video sequences by scaling positional encodings, enabling temporally coherent cross-attention (Low et al., 30 Sep 2025).
- Single-Backbone Parameter-Efficient Design: AVFullDiT as in AV-DiT (Wang et al., 11 Jun 2024) employs a frozen image DiT backbone for both modalities, with lightweight adapters for temporal audio/video specificity and cross-modal fusion, drastically reducing trainable parameter counts.
- Hierarchical Priors for Synchrony: JavisDiT introduces hierarchical spatial–temporal prior tokens modulating attention at multiple scales, while employing a shared DiT stack and bi-directional cross-modal attention throughout the model (Liu et al., 30 Mar 2025).
- Frozen Modular Backbones and Minimal Cross-Attention: Foley Control maintains fully frozen T2A and video encoders, inserting only a compact video-to-audio cross-attention “bridge” after each text-to-audio attention sublayer, using per-frame pooled video tokens to condition the audio transformer (Rowles et al., 24 Oct 2025).
2. AVFull-Attention and Cross-Modal Alignment
The critical innovation in AVFullDiT is the AVFull-Attention mechanism, which unifies cross-modal information flow within each transformer block. Rather than employing separate cross-modal attention layers, AVFull-Attention concatenates audio and video tokens, projects them into a common feature space (via learned adapters for modality dimensionality matching), and performs standard MHSA jointly. This provides symmetric conditioning: video queries attend to audio keys/values and vice versa in the same computation. The absence of separate cross-modal blocks preserves the structure and initialization advantage of large pre-trained unimodal towers.
Temporal alignment between modalities, necessary given their divergent frame and sample rates, is addressed using RoPE embeddings with modality-specific scaling: where is the alignment factor. This ensures real-time events are represented at matched phases between modalities (Wu et al., 2 Dec 2025, Low et al., 30 Sep 2025).
Bidirectional cross-attention, as in Ovi and JavisDiT, is often inserted at each block, allowing each modality’s queries to attend across the entirety of both self and cross-modality key/value spaces. This enables iterative multi-hop fusion of semantics and timing.
3. Diffusion Process and Losses
Most AVFullDiT frameworks employ either DDPM-style discrete time Gaussian diffusion or continuous-time flow-matching/rectified-flow modeling. Forward noising is performed independently for each modality: with per-modality latent velocities as supervision for the network’s outputs. The loss function is typically the sum of squared error to the true velocity for each modality, with possible tunable weights and , though weighting is standard (Wu et al., 2 Dec 2025). Some variants employ a slightly modality-biased weighting, e.g., , (Low et al., 30 Sep 2025).
No additional explicit alignment or adversarial losses are required: cross-modal synchrony emerges from shared time schedules in the denoising process combined with tightly-coupled attention (Wang et al., 11 Jun 2024, Rowles et al., 24 Oct 2025).
At inference, both modalities are sampled jointly—either via DDPM-style stepwise sampling or ODE solvers (for continuous-time flow-matching)—ensuring synchronized generation of the video and audio streams. Classifier-free guidance is commonly used for conditional sampling, with separate guidance strengths for audio and video.
4. Training Paradigms, Parameterization, and Datasets
AVFullDiT systems leverage significant infrastructure for training efficiency:
- Pre-training & Freezing: Leveraging pre-trained, frozen backbones (as in AV-DiT and Foley Control) or initializing both towers identically promotes rapid convergence and high-fidelity marginal distributions for each modality.
- Adapter-Only or Partial Fine-Tuning: Only small adapter matrices (audio adapters, fusion LoRA, bridge cross-attention) are newly trained, often comprising less than of total model parameters. For example, AVFullDiT (Wang et al., 11 Jun 2024) requires only 160M new trainable parameters compared to M in fully fine-tuned baselines.
- Mixed and Large-Scale Data: Training regimes use multi-source datasets—AVSync15, Landscape, GreatestHits, VGGSound, AudioSet, and in some cases, curated multi-task benchmarks—separating small-scale ALT-Merge-like regimes from large-scale (e.g., millions of paired AV clips) for scale ablation (Wu et al., 2 Dec 2025, Low et al., 30 Sep 2025).
- Optimization: Most employ AdamW, low learning rates ( or ), large batch sizes, and inference with $50$–$250$ denoising steps.
A plausible implication is that AVFullDiT variants can be flexibly adapted to different task scales and hardware capabilities due to their modular training and parameter efficiency.
5. Empirical Results, Metrics, and Interpretation
Empirical validation of AVFullDiT centers on both unimodal and cross-modal benchmarks:
- Video-Only Quality Boost from Joint Training: AVFullDiT improves video metrics such as ImageQual, Subject Consistency, Text Consistency, and Physics (physical plausibility) scores, with the most pronounced improvements (+2–3%) for subsets involving object contact or “AV-tight” motion. Joint audio-video denoising provides higher perceived video realism and temporally coherent dynamics, even when only video quality is evaluated (Wu et al., 2 Dec 2025).
- Audio Synchrony and Fidelity: In models such as Ovi and AV-DiT, human preference rates for audiovisual realism and synchrony reach 75–85% vs. strong baselines, with audio FAD and text-to-audio alignment metrics matching or exceeding those of dedicated unimodal models (Wang et al., 11 Jun 2024, Low et al., 30 Sep 2025).
- Cross-Modal Synchronization Metrics: Datasets and scores such as JavisBench/JavisScore (Liu et al., 30 Mar 2025) and MovieGenBench (Rowles et al., 24 Oct 2025) have been proposed for fine-grained AV alignment evaluation, focusing on both chunkwise temporal synchrony and overall semantic coherence.
- User Studies and Qualitative Results: Users consistently prefer AVFullDiT-generated sequences over cascaded or contrastive baselines for audio-visual consistency, with qualitative samples showing tight correspondence between impact events in video and corresponding sounds, or speech and lip motion.
Some architectures, such as Foley Control, demonstrate competitive semantic alignment using frozen backbones and minimal training via cross-attention bridges, suggesting that parameter-efficient approaches retain strong alignment performance (Rowles et al., 24 Oct 2025).
6. Theoretical Insights: Audio as Privileged Causal Signal
A defining hypothesis emerging from AVFullDiT research is that audio functions as a privileged supervisory signal for discovering physical-causal structure in generative models. By jointly predicting audio and video, models are pressured to internalize the physical relationships linking motion and sound (e.g., collisions leading to impact noise), regularizing the spatiotemporal dynamics of video generation and mitigating pathologies such as “contact avoidance” or exaggerated/freezing motion (Wu et al., 2 Dec 2025). This insight is backed by improved commonsense metrics (e.g., Videophy-2 physics score) and ablation studies showing that removing joint audio training degrades these physical and perceptual attributes, even when the video-only loss marginally improves.
A plausible implication is that future world models should incorporate cross-modal causal prediction to achieve physically plausible and generalizable generative performance.
7. Outlook, Limitations, and Extensions
AVFullDiT architectures represent a shift towards truly joint modeling of temporally dense, multi-sensory world data. However, several limitations persist:
- Scalability and Efficiency: Dense all-to-all cross-attention imposes costs, leading to slow inference and training for high-resolution or long-duration content (Liu et al., 30 Mar 2025, Low et al., 30 Sep 2025).
- Duration and Resolution Limits: Current systems are typically validated on 2–5 s, p video and corresponding audio; efficient extensions to minute-scale, variable-length content require further research.
- Synchronization Metrics and Benchmarks: Though metrics like JavisScore and MovieGenBench address AV synchrony, their accuracy leaves room for improvement, especially in diverse, naturalistic scenarios.
Modularity—exemplified by models like Foley Control—is a promising axis for progress: frozen, swappable backbones and minimal “bridges” enable rapid development and deployment of AVFullDiT-style systems on top of ever-advancing unimodal generative models.
In sum, AVFullDiT frameworks enable robust, physically consistent, and scalable audio-video generation through tightly coupled, parameter-efficient diffusion transformers, validated across challenging AV tasks and benchmarks (Wu et al., 2 Dec 2025, Liu et al., 30 Mar 2025, Low et al., 30 Sep 2025, Wang et al., 11 Jun 2024, Rowles et al., 24 Oct 2025).