AV-LLMs: Audio-Video Transformer Models

Updated 22 November 2025

Audio-Video LLMs are transformer models that integrate synchronized audio and visual data to facilitate multimodal understanding and cross-modal reasoning.
They employ advanced fusion strategies such as interleaved merging and cross-attention modules to efficiently align modality-specific features.
Empirical evaluations reveal that optimized architectures enhance performance and mitigate hallucinations across diverse benchmarks like AVQA and music-AVQA.

Audio-Video LLMs (AV-LLMs) are a class of transformer-based architectures that ingest synchronized audio and visual streams—typically video frames and raw or processed audio—enabling multimodal language understanding, open-ended generation, and cross-modal reasoning over temporally extended, richly annotated datasets. AV-LLMs have become foundational for tasks requiring deep, fine-grained integration of dynamic vision and audio, including video captioning, audio-visual question answering (AVQA), speech-augmented comprehension, audio-visual speech recognition (AVSR), and reasoning-intensive multi-step QA grounded in complex or adversarial scenarios. This article systematically details (1) core model architectures and alignment strategies, (2) benchmark design and evaluation, (3) advances in reasoning, preference optimization, and hallucination mitigation, (4) key empirical results and ablation findings, and (5) current efficiency and scalability challenges.

1. Architectural Design and Multimodal Fusion Strategies

AV-LLMs are generally constructed by bridging frozen modality-specific encoders (vision and audio) to a LLM through one or more alignment modules. Canonical architectures include Video-LLaMA, video-SALMONN, Dolphin, C3LLM, and Zero-AVSR, featuring combinations of Q-Formers, adapters, and sophisticated token fusion policies (Zhang et al., 2023, Sun et al., 2024, Guo et al., 2 Apr 2025, Wang et al., 2024, Yeo et al., 8 Mar 2025).

Encoders: Visual streams are processed by ViT-style image encoders (e.g., EVA-CLIP, InstructBLIP), while audio uses models such as Whisper, BEATs, or ImageBind-AST, often operating on mel-spectrograms and synchronized at frame or segment-level granularity.
Alignment modules: Cross-modal alignment is achieved through several strategies. Q-Formers (cross-attention modules) learn to aggregate temporally or semantically meaningful feature sets from each modality and project them into the LLM hidden space. Adapter-based approaches—including LoRA adapters—allow efficient fine-tuning with a small parameter footprint.
Temporal and spatial fusion: Audio and visual tokens are fused via concatenation, interleaved merging, or fine-grained cross-attention at various stages. Dolphin (Guo et al., 2 Apr 2025) utilizes a multi-scale adapter for spatial alignment and interleaved merging for temporal synchronization, while video-SALMONN’s MRC Q-Former pools features at multiple resolutions and enforces causal self-attention across windows (Sun et al., 2024). The LLaVA-AV-SSM model compresses audio tokens using a Mamba-based SSM, achieving scalable attention across long temporal contexts (Kim et al., 22 Sep 2025).

2. Benchmarking, Datasets, and Evaluation Protocols

AV-LLM development is coupled with benchmark design to probe multi-modal reasoning, robustness, and compositional capability:

Generic AVQA and music-focused tasks: Datasets such as AVQA, Music-AVQA, and the more challenging AVQA-Hard and Music-AVQA-Hard support benchmarking under conditions where visual shortcuts are removed (Kim et al., 22 Sep 2025). Metrics are typically accuracy, answer-exact match, and n-gram-based semantic scores.
Instruction and alignment datasets: AVU (Guo et al., 2 Apr 2025) and SAVEn-Vid (Li et al., 2024) provide millions of paired video-audio-caption tuples for instruction tuning, including negative/rejection and modality-focused subsets.
Reasoning and robustness benchmarks: RivaBench (Sun et al., 17 Feb 2025) and AVTrustBench (Chowdhury et al., 3 Jan 2025) assess compositional reasoning, adversarial perturbation, and modality-specific dependency using circular evaluation and calibrated preference metrics.
Specialized emotion and AVSR tasks: AV-EMO-Reasoning (Patel et al., 8 Oct 2025) utilizes continuous emotion regression from jointly annotated audio-video dialog, while Zero-AVSR (Yeo et al., 8 Mar 2025) benchmarks zero-shot speech recognition in low-resource and unseen languages.

3. Reasoning, Preference Optimization, and Hallucination Mitigation

Advanced AV-LLMs integrate explicit training or inference-time techniques to address the logical complexity and reliability of multi-modal generation:

Preference optimization: Multi-round DPO (MrDPO) (Tang et al., 18 Jun 2025) and process DPO (pDPO) (Sun et al., 17 Feb 2025) optimize LoRA-adapted policies on preference pairs decomposed into event-level or step-level feedback, leading to substantial reductions in missing/hallucinated content and higher QA accuracy. CAVPref extends DPO with explicit audio/video calibration terms and distributional robustness to counter bias toward a dominant modality (Chowdhury et al., 3 Jan 2025).
Hallucination suppression: AVCD (Jung et al., 27 May 2025) introduces trimodal contrastive decoding, masking non-dominant modalities based on attention statistics and combining logits from multiple corrupted/uncorrupted modality passes to penalize hallucinated generations. This yields significant gains on hallucination-focused benchmarks such as AVHBench.
Modality bias correction: Fork-Merge Decoding (FMD) (Jung et al., 27 May 2025) forks audio and video tokens through early decoder layers, merging them for joint reasoning in upper layers by a weighted sum, rectifying modality under-utilization. Dolphin applies unpaired mixed training and diversity loss, while AVU integrates rejection-tuning to further limit hallucination (Guo et al., 2 Apr 2025).
Efficiency and scaling: AccKV (Jiang et al., 14 Nov 2025) replaces naive selective KV caching with layer-adaptive focusing and cross-calibration, minimizing memory/FLOP requirements by evicting or merging less-attended tokens dynamically per layer.

4. Key Empirical Findings and Ablation Insights

Extensive head-to-head comparisons across benchmarks and ablation studies clarify the incremental and synergistic gains of architectural and training advances:

Model/Technique	AVQA-Hard Accuracy	Music-AVQA-Hard	AVHBench Gain vs. Base	Video Captioning Error Reduction
LLaVA-AV-SSM + audio+Mamba (Kim et al., 22 Sep 2025)	71.6% (+4.5pp)	36.8% (+1.2pp)	–	–
AVCD (contrastive decoding) (Jung et al., 27 May 2025)	–	–	+6–11%	–
video-SALMONN 2 + MrDPO (Tang et al., 18 Jun 2025)	–	–	–	−28% vs. base
video-SALMONN-o1 + pDPO (Sun et al., 17 Feb 2025)	+3–8% (over base)	–	–	–
CAVPref (Chowdhury et al., 3 Jan 2025)	–	–	+20–30% across tasks	–

Ablation on fusion strategies: Interleaved merging and bidirectional cross-attention outperform naive concatenation (Dolphin), while fork-merge and Mamba compression outperform non-interleaving or uncompressed sequences.
Failure modes: Audio hallucinations are prevalent if audio signals are ignored (32% hallucination rate (Nishimura et al., 2024)); bag-of-words shortcutting persists in compositional QA tasks. Modal bias in attention (e.g., 70% to video) is empirically corrected by FMD or AccKV.
Emergent abilities and robust gains: video-SALMONN family models show emergent speech-visual co-reasoning (e.g., lip-reading, cross-modal identification). Dolphin and AVTrustBench studies reveal that robust preference optimization yields large gains on compositional and adversarial subtasks.

5. Efficiency, Scalability, and Practical Model Deployment

Addressing the computational cost of long audio-video streams, especially in long-form video or conversational settings, has driven technical optimization:

Token compression: Causal SSM (Mamba) in LLaVA-AV-SSM compresses 25 Hz audio into 1 Hz token streams, permitting hour-long video inference within feasible memory bounds (Kim et al., 22 Sep 2025).
KV cache management: AccKV’s layer-adaptive focusing and cross-calibration reduce KV cache memory by up to 90% and total inference latency by ~600 ms per 1k tokens, with <2% accuracy loss on standard benchmarks (Jiang et al., 14 Nov 2025).
Plug-and-play strategies: Both AVCD and Fork-Merge Decoding are training-free and can be ablated in existing architectures without modification, supporting efficient inference and experimentation across proprietary or frozen model weights (Jung et al., 27 May 2025, Jung et al., 27 May 2025).
Modality-agnostic extensions: Techniques for dynamic fusion, caching, and preference calibration generalize to additional modalities and tri-modal or higher scenarios, including speech, sensor, and text channels (Jiang et al., 14 Nov 2025, Guo et al., 2 Apr 2025).

6. Open Challenges, Limitations, and Future Directions

Despite robust advances, AV-LLMs face critical bottlenecks in their path to human-level multimodal comprehension:

Benchmark limitations: Most “video understanding” benchmarks can still be solved by visual-only inference; genuine audio sensitivity is only exposed by filtering single-frame answerable items (AVQA-Hard, Music-AVQA-Hard) (Kim et al., 22 Sep 2025). AVTrustBench and RivaBench set a new standard by directly targeting adversarial, compositional, and modality-ablation settings (Chowdhury et al., 3 Jan 2025, Sun et al., 17 Feb 2025).
Hallucination and modality entanglement: Audio hallucination rates remain high in AV-LLMs (Nishimura et al., 2024); even advanced models must balance out contrastive decoding, attention re-weighting, or explicit calibration losses to avoid text-only shortcuts (Jung et al., 27 May 2025, Chowdhury et al., 3 Jan 2025).
Data diversity and supervision: Most instruction datasets are English-focused and limited in real-world conversational or non-Western contexts. Speech-heavy tasks (e.g., AVSR) still lag behind specialized speech models unless large-scale, language-agnostic corpora and adaptable romanizer-LLM pipelines are introduced (Yeo et al., 8 Mar 2025).
Scalability and real-time adaptation: End-to-end adaptation requires learning more scalable fusion and streaming strategies, variable context windowing, and real-time streaming AV processing (as in Dolphin’s planned real-time extension) (Guo et al., 2 Apr 2025).

A central implication is that further progress will rely on advances in dataset scope, compositional evaluation, dynamic fusion algorithms, and robust, modality-calibrated training objectives, complemented by scalable, modular architectures that can flexibly ingest and align rich, temporally extended audio-visual streams.