AV-Learning: Multimodal Audio-Visual Methods

Updated 4 July 2026

AV-learning is a multimodal approach that combines audio and visual signals to learn shared representations and enhance task performance.
It employs methods such as co-occurrence retrieval, masked prediction, and data-centric alignment to fuse heterogeneous modalities effectively.
Research in AV-learning addresses challenges in modality synchronization, cross-modal alignment, and application-specific adaptations in speech and biomedical imaging.

In the literature surveyed here, AV-learning most often denotes audio-visual learning: the learning of shared representations, fusion operators, or generation mechanisms from paired acoustic and visual streams. Current work spans speech perception, retrieval from instructional videos, event localization, segmentation, question answering, physically grounded acoustic inference, educational interaction, and anti-UAV sensing. A separate biomedical usage applies the same abbreviation to artery-vein classification in retinal imaging, so the term is context-sensitive rather than semantically uniform (Han et al., 2024, Rouditchenko et al., 2020, Alam et al., 2020).

1. Scope and task families

AV-learning couples signals that are synchronized, partially aligned, or only weakly corresponding across modalities. In speech settings, the pair is typically waveform or log-filterbank audio plus mouth-region video; in scene and acoustics settings, it can be mono or binaural audio plus RGB, panoramic, or depth-aware imagery; in educational systems, it can include lecture video, transcripts, timestamps, voice queries, and avatar-rendered responses. The objective may be a shared embedding, a fused contextual representation, a mask or waveform generator, a localization or segmentation map, or a downstream symbolic output such as text or answers (Han et al., 2024, Bhosale et al., 2024, Islam et al., 24 Dec 2025, Li et al., 6 Mar 2026).

Area	Representative tasks	Example systems
Speech-centered AV-learning	AVSR, AVS2TT, AVSE, speaker verification, speech separation	XLAVS-R, AV-data2vec, FAVA, AVLIT, LR-AVSE
Retrieval and representation learning	Audio-video retrieval, spatial alignment, cross-modal retrieval	AVLnet, AVSA-based learning, ILI-guided embedding learning
Scene understanding	AVE, AVVP, AVS, AVQA, sound source localization	AV-Unified, CAE-AV
Physical acoustics and sensing	NVAS, RIR estimation, anti-UAV trajectory estimation	AV-GS, AV-RIR, AV-DTEC
Educational systems	Lecture-aware retrieval and avatar response generation	ALIVE
Biomedical alternate usage	Artery-vein classification in OCT/OCTA	AV-Net

A unifying property across these settings is that the visual stream is treated as either complementary evidence when audio is degraded, a geometric or material prior for acoustics, or a semantic anchor for disambiguating weakly supervised audio labels. This suggests that AV-learning is not a single architecture class but a family of multimodal inference strategies shaped by the kind of cross-modal structure a task exposes.

2. Learning objectives and representation strategies

Three representation-learning patterns recur. The first is co-occurrence- or retrieval-based learning. AVLnet learns a shared embedding space directly from raw instructional videos, using randomly segmented video clips and raw audio waveforms rather than captions or ASR transcripts, and optimizes a bidirectional retrieval objective with Masked Margin Softmax (Rouditchenko et al., 2020). Spatially grounded variants strengthen the supervision signal: AVSA replaces generic correspondence with audio-visual spatial alignment on 360° video and FOA audio, using object-centric crops and direction-aware audio processing; FOA-IV improves AVSA from 71.75% to 81.06%, showing that explicit spatial audio features are especially compatible with spatially discriminative self-supervision (Wang et al., 2022).

The second pattern is masked prediction with contextual targets. AV-data2vec uses a shared transformer encoder for audio, video, and joint AV input, and regresses masked student representations to teacher targets averaged over upper transformer layers. Its pretraining objective is

$L_{\text{pretrain}} = \alpha \sum_{t \in I} \|z_t - y_t\|_2^2 + \beta \sum_{t \notin I} \|z_t - y_t\|_2^2,$

with audio-only teacher targets and scheduled mixtures of AV, audio-only, and video-only student inputs (Lian et al., 2023). XLAVS-R adopts masked multimodal cluster prediction but simplifies AV-HuBERT-style iterative target refresh by quantizing contextualized audio-only multilingual representations from the 36th layer of XLS-R 2B into 2000 k-means units, then training with

$L = - \sum_{t \in M} \log p_t(z_t) - \alpha \sum_{t \notin M} \log p_t(z_t).$

The method couples masked prediction, modality dropout, and multilingual audio-only initialization to create a shared speech space that later absorbs visual evidence (Han et al., 2024).

A third pattern is data-centric or structure-aware alignment. One line infers latent class dependencies rather than treating all unlabeled co-occurrences as negatives. Audio-Visual Semantic Alignment Loss, the Inferred Latent Interaction Graph, and the Latent Interaction Regularizer use teacher soft labels, GRaSP-based graph inference, and graph-weighted embedding regularization to improve AV cross-modal retrieval on AVE and VEGAS (Zeng et al., 17 Jan 2026). Another line operates on the data itself: an agentic workflow captions audio and video separately, reasons over mismatches, edits the audio with predefined denoising or coordination actions, and iterates under alignment and synchronization scores before standard AV learners consume the data (Mo et al., 2024). A related frozen-backbone strategy uses agreement- and caption-guided enrichment rather than raw fusion: CAE-AV estimates frame-level agreement, balances spatial and temporal enrichment, then injects caption-aligned semantic guidance only at salient token positions (Hu et al., 9 Feb 2026).

3. Speech-centered AV-learning

Speech is the most developed AV-learning domain in the surveyed literature. Here the visual stream, especially lip motion, is used to improve robustness to acoustic noise, overlapped speech, or limited supervision. XLAVS-R formalizes this at multilingual scale: it first trains an audio-only multilingual SSL model on 436K hours in 128 languages, then injects vision through continual AV pretraining on 1.2K hours in 9 languages, with evaluation on MuAViC showing strong gains for both AVSR and AVS2TT under noisy inputs (Han et al., 2024). AV-data2vec reaches a similar design goal from a different angle, using a shared transformer with contextualized latent targets and strong low-resource AVSR/VSR/ASR transfer (Lian et al., 2023).

Problem	Reported result	System
Noisy multilingual AVSR	37.3 WER in noisy AV mode	XLAVS-R 2B
Noisy multilingual AVS2TT	18.7 BLEU in noisy AV mode	XLAVS-R 2B
LRS3-TED AV-ASR	1.7% clean and 6.6% noisy WER	FAVA
Speaker verification	1.0% VC1 clean EER and 2.5% VC1 noisy EER	AV-HuBERT AV
In-the-wild speech separation	12.42 dB SI-SDRi on LRS3+WHAM!	AVLIT-8

A distinct question is whether AV pretraining is necessary at all. FAVA answers this by pretraining only on audio with BEST-RQ, then performing supervised AV fine-tuning. On LRS3-TED it reaches 1.7% clean and 6.6% noisy WER, within 0.5% absolute WER of state-of-the-art AV-SSL systems while being 12–30x faster to pre-train; the same recipe also converts a large audio-only USM model into a competitive AV model without any AV data during pretraining (May et al., 2023). This suggests that in some speech regimes, AV-learning can be decomposed into strong audio-only representation learning plus later visual injection, rather than requiring fully paired AV SSL from the start.

Other speech applications emphasize different aspects of multimodality. Multichannel AV-wav2vec2 extends wav2vec-style SSL to six far-field microphones plus video, using intra- and inter-channel contrastive losses and auxiliary single-channel audio; its multichannel representation improves far-field AVSR, ASR, VSR, and AVSD in realistic home TV rooms (Zhu et al., 2024). AVLIT addresses audio-visual speech separation with a lightweight iterative architecture in which visual lip embeddings are injected early into a progressive audio refinement loop; on LRS3+WHAM!, AVLIT-8 reaches PESQ 1.52, ESTOI 0.68, and SI-SDRi 12.42 dB while remaining much smaller than prior AV baselines (Martel et al., 2023).

Speaker-centered and forensic settings reveal that AV-learning is not restricted to phonetic content. AV-HuBERT transfers effectively to speaker representation learning: incorporating visual information from the lip area reduces EER by 38% in clean conditions and 75% in noisy conditions according to the paper’s main claim, and with full labeled VoxCeleb2 fine-tuning the AV model reaches 1.0% VC1 clean EER and 2.5% VC1 noisy EER (Shi et al., 2022). AV-Lip-Sync+ uses AV-HuBERT to model multimodal inconsistency for video deepfake detection, combining joint AV embeddings with an explicit lip–audio discrepancy feature and a temporal CNN; with an added full-face ViViT branch it reaches 0.9996 LQ AUC and 0.9998 HQ AUC on DeepfakeTIMIT (Shahzad et al., 2023). At the training-objective level, LR-AVSE replaces purely signal-level supervision with PPO fine-tuning under an LLM-derived quality reward, improving the AVSEC-4 baseline from PESQ 1.45 to 1.57 and from NISQA 0.99 to 1.29, with human preference also favoring the LLM-guided system over both the supervised baseline and a DNSMOS-based RL baseline (Chen et al., 14 Mar 2026).

4. Alignment, grounding, and robustness

A major theme in AV-learning is that paired modalities are often not strictly aligned. The surveyed work treats alignment as temporal, spatial, semantic, or label-structural rather than as a generic fusion problem. In ALIVE, lecture retrieval is explicitly anchored to the learner’s pause time: after FAISS retrieves semantically similar transcript segments, the score is temporally reranked as

$\tilde{d}_i = d_i - \lambda \frac{\left| (s_i + e_i)/2 - t \right|}{60},$

so clarification remains grounded in the exact lecture moment where confusion occurred (Islam et al., 24 Dec 2025). In CAE-AV, frame-level agreement between audio and video determines whether current-frame spatial enrichment or neighboring-frame temporal enrichment should dominate, and caption-aligned semantics are then injected only into salient positions; with frozen backbones this improves AVQA, AVS, AVE, and AVVP under common off-screen or cluttered mismatch cases (Hu et al., 9 Feb 2026).

Spatial grounding can also be built into the pretext task. AVSA uses 360° video with synchronized FOA to distinguish not just whether an audio segment and a crop co-occur, but whether the crop aligns with sound energy from the corresponding direction (Wang et al., 2022). At the dataset level, AVAgent treats misalignment as a preprocessing problem: separate audio and video captions are generated, an LLM planner selects one of eight audio-editing actions, and reflection scores determine whether the modified pair is retained, with a threshold of 0.85 in the published algorithm (Mo et al., 2024). At the label level, inferred latent interaction graphs replace the false-negative assumption that label absence means semantic absence, allowing dependency-linked but unlabeled AV pairs to be softly attracted rather than pushed apart (Zeng et al., 17 Jan 2026).

Robustness mechanisms often accompany alignment. XLAVS-R uses modality dropout with $p_m = 0.5$ and $p_a = 0.5$ during pretraining, alongside noisy audio augmentation, so the encoder learns audio-only, visual-only, and AV cases in one shared space (Han et al., 2024). AV-DTEC adopts an explicitly asymmetric design in which audio is primary and visual features are weighted by an estimated visual existence probability, improving dark-condition robustness for drone trajectory estimation and classification (Xiao et al., 2024). These designs suggest that AV-learning increasingly treats unreliable correspondence as a first-order modeling problem rather than as noise to be averaged away.

5. Non-speech, scene-level, and physically grounded AV-learning

Outside speech, AV-learning often targets semantic retrieval, scene understanding, acoustics, or interactive systems. AVLnet is a representative early case: it learns a shared audio-video embedding from raw HowTo100M clips without text, showing that naturally co-occurring narration, actions, and environmental sounds are sufficient to learn transferable retrieval-oriented semantics (Rouditchenko et al., 2020). AV-Unified extends this logic from one task to many, standardizing inputs and outputs across AVE, AVVP, VGG-SS, AVS, and MUSIC-AVQA and combining a multi-scale temporal perception module with a cross-modal guidance-based spatial perception module. Joint training improves most tasks relative to single-task training, for example raising AVQA average from 75.32 to 76.42 and improving VGG-SS from 38.94/40.80 to 39.16/41.24 in CIoU/AUC (Li et al., 6 Mar 2026).

A more physically grounded branch of AV-learning uses visual scene structure as an acoustic prior. AV-GS replaces NeRF-style implicit conditioning with an explicit Gaussian scene representation augmented by per-point audio-guidance variables, using source- and listener-relative context to synthesize novel-view binaural audio. On RWAVS, AV-GS improves over AV-NeRF from 1.504 to 1.417 in MAG and from 0.145 to 0.140 in ENV, while reducing inference to 0.08 s per query (Bhosale et al., 2024). AV-RIR similarly treats room acoustics as multimodal inference: reverberant speech carries evidence of the room filter, while panoramic RGB and Geo-Mat features provide geometry- and material-aware cues. The full model improves over S2IR-GAN by 36% on $T_{60}$ , 42% on DRR, 63% on EDT, 89% on EMSE, and 98% on LMSE, and CRIP improves the late component LMSE by 86% during inference (Ratnarajah et al., 2023).

Educational and robotics-style systems show another direction. ALIVE turns recorded lecture video into a local, grounded tutoring environment: Whisper produces timestamps, transcript segments of about 20 seconds are embedded with SentenceTransformers and indexed in FAISS, and a local Llama 3.1 8B model answers text or voice questions, optionally rendered as a SadTalker-based avatar (Islam et al., 24 Dec 2025). AV-DTEC uses LiDAR-generated pseudo labels to train an audio-primary, visually assisted anti-UAV model with state-space backbones; it reaches mean APE 0.67 and mean accuracy 99.3, including dark-condition APE 0.75 and accuracy 98.9 (Xiao et al., 2024). These systems suggest that AV-learning is expanding from passive representation learning toward closed-loop interaction, sensing, and embodied inference.

6. Terminological ambiguity, recurring limitations, and research directions

The abbreviation “AV” is not semantically stable across fields. In ophthalmic OCTA, AV-Net uses AV to mean artery-vein classification rather than audio-visual learning. It formulates the task as pixel-wise semantic classification over a 320 × 320 × 2 en face OCT/OCTA input and reports 86.75% average accuracy, 70.72% mean IoU, and 82.81% F1 with a modified U-shaped FCN and ImageNet-based encoder transfer (Alam et al., 2020). This separate usage matters because it shows that “AV-learning” can denote multimodal artery-vein learning in biomedical imaging, not only audio-visual representation learning.

Several limitations recur across the audio-visual literature itself. Data asymmetry remains central: XLAVS-R is motivated by the fact that audio-only speech is abundant and multilingual while AV speech is scarce and narrow in language coverage, and its translation evaluation remains limited to X-to-English directions while noisy testing uses only babble noise (Han et al., 2024). Systems that depend on visible mouths or aligned lecture playback inherit modality-availability constraints: AV-HuBERT-based speaker verification and AV-Lip-Sync+ require synchronized lip regions, while ALIVE notes that avatar synthesis remains the slowest and most hardware-sensitive component and that retrieval is text-only rather than directly visual (Shi et al., 2022, Shahzad et al., 2023, Islam et al., 24 Dec 2025). Scene-level acoustic models still generalize poorly across scenes: AV-GS is trained per scene and assumes static environments, while AV-RIR currently assumes stationary single-talker input or single-source audio without noise (Bhosale et al., 2024, Ratnarajah et al., 2023). Retrieval-style AV-learning also struggles with sparse labels and incidental co-occurrence, which motivates soft-label alignment and graph-based regularization (Zeng et al., 17 Jan 2026).

The papers themselves indicate several research directions. These include broader noise conditions and translation directions for multilingual AV speech; richer multimodal retrieval over slides, diagrams, and equations in educational systems; faster talking-head generation and multi-turn dialogue; larger multilingual AV corpora and stronger visual front ends; cross-scene transfer for acoustic synthesis and RIR estimation; and more principled handling of modality reliability, off-screen sources, and unannotated events (Han et al., 2024, Islam et al., 24 Dec 2025, Bhosale et al., 2024, Lian et al., 2023, Hu et al., 9 Feb 2026). A plausible implication is that future AV-learning will be shaped less by a single canonical fusion architecture than by three interacting questions: what structure the modalities truly share, how much alignment can be assumed, and which modality should dominate when that assumption fails.