Audio-Visual Multi-Scale Adapters

Updated 4 September 2025

Audio-Visual Multi-Scale Adapters are architectural modules designed to align and fuse audio and visual data across varied spatial and temporal scales.
They employ cross-modal alignment, multi-scale feature aggregation, and efficient fusion strategies to improve tasks like event localization and speech separation.
Parameter-efficient designs allow these adapters to provide robust, adaptable multimodal processing for applications such as video understanding and AV-LLMs.

An Audio-Visual Multi-Scale Adapter is a principled architectural module, or suite of modules, designed to enable deep neural networks to align, fuse, and modulate information from audio and visual modalities across multiple temporal or spatial scales. Such adapters are foundational in audio-visual event localization, segmentation, video understanding, speech separation, and multi-modal continual learning. Architectures employing these adapters consistently incorporate mechanisms for both intra-modal and cross-modal correlation exploitation, aggregation of features at variable spatial or temporal resolutions, and parameter-efficient fusion strategies.

1. Definitions and Core Design Patterns

Audio-Visual Multi-Scale Adapters are inserted into deep networks as plug-in modules to facilitate effective interaction between auditory and visual features, typically at three levels:

Cross-modal alignment: Leveraging attention or normalization mechanisms so that features in one modality are semantically conditioned by those in the other (e.g., Cross-Modal Normalization, cross-attention).
Multi-scale integration: Aggregating features over different temporal or spatial windows to capture both local and global dependencies (e.g., Multi-Scale Proposal Modulating Module in M2N (Wang et al., 2021), multi-window attention in AVE-CLIP (Mahmud et al., 2022), Branchformer-based encoding in AVFSNet (Zhang et al., 17 Jul 2025)).
Downstream ready fusion: Producing unified multimodal representations that can be easily consumed by segmentation, event localization, or LLMs.

Key designs such as M2N’s normalization stack, LAVISH’s latent bottleneck adapters (Lin et al., 2022), Dolphin’s fine-grained spatial and temporal modules (Guo et al., 2 Apr 2025), and cross-domain LoAA modules (Yeo et al., 8 Dec 2024) exemplify these patterns.

2. Multi-Scale Feature Encoding and Fusion

Multi-scale processing is central to adapter effectiveness:

Temporal multi-scale: AVE-CLIP (Mahmud et al., 2022) deploys the Multi-window Temporal Fusion (MWTF) module, splitting sequences into blocks of differing lengths to target both short-term and long-term dependencies. The Multi-Scale Proposal Modulating Module (MSPM) in M2N (Wang et al., 2021) aggregates event proposals over arbitrary-length temporal spans, constructing an event proposal map F_ms[i, j] over all intervals [i, j].
Spatial multi-scale: Dolphin (Guo et al., 2 Apr 2025) uses a multi-scale pyramid to produce features at 1/8, 1/16, and 1/32 downsampling ratios, aligning audio and visual modalities at each scale. SAVE (Nguyen et al., 2 Jul 2024) enhances SAM via per-block adapters that operate over both channel and spatial dimensions, embedding dataset-specific and audio-informed adjustments throughout the encoder.
Cross-modality multi-scale: WS-AVS (Mo et al., 2023) and M2VSL (Mo et al., 31 Aug 2024) introduce multi-instance, multi-scale contrastive alignment losses, ensuring that audio features are drawn to the correct visual regions over multiple resolutions, not just the global map.

These design principles allow for robust representation of long-range dependencies (e.g., musical phrases or object trajectories), fine-grained localization (e.g., lip movement transitions), and resilience to varying event or object durations.

3. Adapter Mechanisms and Modulation Schemes

Audio-Visual Multi-Scale Adapters typically rely on efficient and expressive parameterizations:

Attention Bottlenecking: LAVISH (Lin et al., 2022) utilizes a small set of latent “summary” tokens in each transformer layer to bridge modalities, avoiding the quadratic complexity of full cross-attention.
Normalization-based Modulation: Cross-Modal Normalization (CMN) and Intra-Modal Normalization (IMN) in M2N (Wang et al., 2021) use multi-head attention to generate per-segment scaling and shifting, yielding the formula:

$f_v^c = \gamma_v^c \cdot \frac{f_v - \mu(f_v)}{\sigma(f_v)} + \beta_v^c$

where scaling and shifting parameters are computed from attention over the other modality.

Convolutional/Pooling Adapters: LoAA (Yeo et al., 8 Dec 2024) replaces classic 1×1 adapter projections with 1D convolutions along time or frequency, enabling direct aggregation of neighboring time-frequency bins in spectrograms.
Plug-and-Play Temporal Fusers: Deception detection work (Li et al., 2023) introduces adapters with 1D convolutions over temporal tokens, preserving robustness even if one modality is missing at inference.

Adapters may operate in parallel within transformer attention and MLP blocks, or as cross-layer or block-stack modules, as in AVS-Mamba’s selective state space model (Gong et al., 14 Jan 2025) and Dolphin’s multi-stage cross-attention (Guo et al., 2 Apr 2025).

4. Applications: Localization, Segmentation, Separation, and AV-LLMs

Multi-scale adapters underpin state-of-the-art performance in diverse tasks:

Event Localization: M2N (Wang et al., 2021) and AVE-CLIP (Mahmud et al., 2022) demonstrate accuracy of 79.5% and 83.7% respectively on AVE localization, with the former improving cross-modality localization metrics (A2V/V2A) over prior baselines.
Segmentation: SAVE (Nguyen et al., 2 Jul 2024) surpasses previous SOTA on AVSBench-S4 (up to 86.16 mIoU) with a low-resolution variant. MoCA (Bhosale et al., 21 Mar 2024) achieves +17.24% mIoU over baselines in unsupervised settings via foundation-model-based audio-visual adapters and pixel-level matching.
Speech Separation: AVFSNet (Zhang et al., 17 Jul 2025) achieves ~14.34 dB SI-SDRi on LRS2-2mix through its Branchformer-based multi-scale encoder and parallel separation architecture.
Multi-modal Continual Learning: PHP (Yin et al., 29 Jul 2025) employs a three-stage adapter suite—universal task-shared adapters, prompt-based dynamic adapters, and task-modal-specific deep prompts—to achieve SOTA mean accuracy and lowest forgetting on AVE, AVVP, AVS, and AVQA tasks, with explicit multi-scale reasoning available via progressive design.
LLMs: Dolphin (Guo et al., 2 Apr 2025), equipped with multi-scale adapters for fine-grained spatial and temporal alignment, outperforms previous video-LLMs and reduces audio-visual hallucinations, crucial for robust open-domain audio-visual language processing.

5. Parameter Efficiency, Modality Robustness, and Scalability

A consistent finding is that parameter-efficient adapters are effective even without full audio or multi-modal pretraining:

Frozen Backbone with Selective Tuning: LAVISH (Lin et al., 2022) and LoAA (Yeo et al., 8 Dec 2024) show that, by training only a few adapter parameters (often just 2%–10% of model total), competitive or superior results can be obtained versus full-model fine-tuning.
Dynamic Adapter Selection: Adapter-Based AVSR (Simic et al., 3 Feb 2025) employs multiple LoRa-based adapter sets, each tailored for specific noise types or levels, selected at inference with a noise classifier—shrinking parameter usage by up to 88.5% compared to full SOTA ASR.
Flexible Modality Handling: The AVA framework (Li et al., 2023) and similar architectures are robust to missing modalities, thanks to their design choices in token fusion and dropout-style redundancy.
Large-Scale Training: AVSiam (Lin et al., 28 Mar 2024) demonstrates that a single shared ViT, paired with contrastive and reconstruction objectives, can match or outperform larger two-stream models at a much lower compute/memory cost.

These approaches ensure practical deployment in varied and large-scale applications, as well as ease of adaptation for evolving downstream tasks.

6. Broader Implications and Future Directions

The development and deployment of Audio-Visual Multi-Scale Adapters reflects key trends:

Unified Multi-modal Processing: SOTA results across event localization, segmentation, and multimodal LLMs illustrate convergence towards architectures capable of joint spatial-temporal multi-scale reasoning.
Granular Fusion and Avoidance of Hallucination: Dolphin (Guo et al., 2 Apr 2025) and PHP (Yin et al., 29 Jul 2025) incorporate adapters to facilitate reliable, grounded, and interpretable audio-visual understanding, with direct impact on the global adoption of AV-LLMs.
Extensibility to Other Modalities: Trimodal adapters in LSTTA (Liu et al., 2023), handling audio, visual, and language cues, suggest that multi-scale adapter schemes are a general mechanism for efficient cross-modal understanding.
Open Research Questions: Open issues include optimal adapter insertion depth and placement, real-time efficiency for streaming applications, improved temporal/spatial alignment for highly dynamic scenes, and robust unsupervised learning (e.g., MoCA (Bhosale et al., 21 Mar 2024)).

7. Representative Implementations and Key Formulations

A tabular summary of principal module types:

Adapter Mechanism	Representative Architecture	Parameterization/Key Operation
Latent Bottleneck	LAVISH (Lin et al., 2022)	Attention over $m \ll N$ summary tokens; bottleneck MLP
Temporal Conv Adptr	LoAA (Yeo et al., 8 Dec 2024)	1D convs (kernels 1×3, 3×1) in adapters, time/freq mixing
Multi-Scale Prop Mod	M2N (Wang et al., 2021)	2D event proposal, avg pooling, 2D conv, softmax weighting
AV Guided CrossAttn	Dolphin (Guo et al., 2 Apr 2025)	Audio guides visual features via cross-attn at all scales
Flexible Plug-In	AVA (Li et al., 2023), SAVE (Nguyen et al., 2 Jul 2024)	Residual adapters; 1D/2D convs plus MLP; per-block injection

Rigorous implementations involve multi-head attention, channel/spatial/temporal poolings, contrastive learning objectives (e.g., multi-instance or InfoNCE), and recurrent or bidirectional refinement (as in EGTA for temporal refinement (Mahmud et al., 2022)).

Audio-Visual Multi-Scale Adapters are now fundamental for enabling efficient, robust, and interpretable fusion of heterogeneous temporal and spatial cues across modalities. Their modular design, capacity for fine-grained multi-scale correlation, and flexibility in deployment render them central to current and emerging multi-modal AI systems.