Papers
Topics
Authors
Recent
Search
2000 character limit reached

Noise-Robust AVSR Framework

Updated 25 January 2026
  • Noise-Robust AVSR is a framework that fuses audio and visual modalities to counteract noisy speech conditions by leveraging techniques like modality dropout and cross-modal training.
  • Its modular architecture employs convolutional encoders, Transformer layers, and robust noise augmentation to enable effective multilingual adaptation and zero-shot inference.
  • Empirical benchmarks show significant reductions in word error rates and improvements in BLEU scores, demonstrating the framework's scalability and cross-lingual transfer capabilities.

Noise-robust audio-visual speech recognition (AVSR) frameworks are sophisticated systems designed to maintain low word error rate (WER) when audio signals are contaminated by noise, leveraging visual cues such as lip motion to recover or supplement lost phonetic information. These systems address key challenges in realistic environments, such as limited availability of multilingual AV corpora and the necessity to generalize to unseen noise types, by integrating architectural innovations, pre-training strategies, robust feature fusion, and domain-appropriate evaluation protocols.

1. Problem Formulation and Cross-Modal Motivation

Noise-robust AVSR seeks, given IaRTa×CaI_a \in \mathbb{R}^{T_a \times C_a} (audio waveform) and IvRTv×Cv×W×HI_v \in \mathbb{R}^{T_v \times C_v \times W \times H} (video frames), to learn

y^1:L=argmaxy1:Lp(y1:LIa,Iv)\hat y_{1:L} = \arg\max_{y_{1:L}} p(y_{1:L} \mid I_a, I_v)

so that WER is minimized under real-world noise conditions. Purely audio models—despite advances in self-supervised representation learning—show sharply increased WER under babble, overlapped speech, and other non-stationary noise. Integrating visual signals (lip motion, facial gestures) offers a mechanism to recover phonetic cues, as demonstrated by the Sumby–Pollack effect. However, most available corpora are audio-only and AV datasets remain scarce in non-English languages, intensifying the need for frameworks that exploit cross-lingual transfer (Han et al., 2024).

2. Architecture: Modular Encoders, Fusion, and Multilingual Adaptation

XLAVS-R exemplifies current best practices for large-scale, noise-robust AVSR:

  • Audio encoder: A convolutional wav2vec 2.0 front-end mapping raw waveform to high-dimensional vectors, followed by 24–48 Transformer layers (D=1024/1920D=1024/1920).
  • Visual encoder: ResNet-18 backbone mapping lip crops to embeddings, linearly projected to match audio embedding dimension.
  • Fusion module: Frame-wise weighted addition, ffusion=Wafa+Wvfvf_\text{fusion} = W_a f_a + W_v f_v, augmented by modality dropout (pm=0.5p_m=0.5), which randomly zeroes out audio or video during pre-training to enforce representational alignment and robustness.
  • Multilingual adaptation: XLAVS-R pre-trains on 436Kh of audio-only speech in 128 languages (XLS-R), followed by continued pre-training on a 9-language AV corpus (MuAViC). This sequence enables coverage of 100+ languages via the audio backbone, with AV refinement and limited data (Han et al., 2024).

3. Pre-training, Fine-tuning, and Robustness Strategies

The XLAVS-R training pipeline comprises:

  • Stage 1: Audio-only SSL (XLS-R): wav2vec 2.0 masked-contrastive objective, extracting robust phonetic features.
  • Stage 2: Audio-visual SSL (XLAVS-R): Masked prediction with 50% masking of audio and visual frames; targets are cluster IDs (ztz_t) from K-means over XLS-R encoder output. Noise is injected into 25% of pre-training samples using babble at 0dB SNR (MUSAN/MuAViC).
  • Supervised fine-tuning: 6-layer Transformer decoder for both AVSR and AVS2TT, cross-entropy over transcription or translation targets. Fine-tuning batches alternate between audio-only and AV, maintaining modality dropout and continuing noise augmentation at 0dB SNR in 50% of samples.

Robustness is further fostered by clean-to-noisy alignment and the fusion module’s ability to promote visual modality reliability when audio is degraded (Han et al., 2024).

4. Loss Functions and Optimization

  • Audio-visual SSL loss: Lssl=tMlogpt(zt)αtMlogpt(zt)\mathcal{L}_\mathrm{ssl} = -\sum_{t\in M}\log p_t(z_t) - \alpha \sum_{t\notin M}\log p_t(z_t) where MM indexes masked frames and pt(z)p_t(z) is the softmax over cluster targets. α=0.1\alpha=0.1 down-weights unmasked regions.
  • Fine-tuning objective: Lft=i=1Llogp(yiy<i,Ia,Iv)\mathcal{L}_\mathrm{ft} = -\sum_{i=1}^L \log p(y_i \mid y_{<i}, I_a, I_v) No contrastive or CTC terms are employed in XLAVS-R’s AV stage.

Regularization is enforced via dropout and noise-augmented batches throughout SSL and supervised training.

5. Evaluation Protocols and Quantitative Benchmarks

XLAVS-R's efficacy is established on:

  • MuAViC benchmark: 9 language AVSR, 6-language AVS2TT, both clean and 0dB babble noise conditions.
  • FLEURS dataset (OOD): Audio-only ASR in same languages.

Metrics:

  • WER: For AVSR, lower values signify improved robustness.
  • BLEU: For AVS2TT, higher indicates better translation quality.

Results:

  • Noisy AVSR: XLAVS-R 2B reduces WER to 37.3% versus AV-HuBERT’s 56.1% (18.8 percentage point, ≈33% absolute reduction).
  • AVS2TT: XLAVS-R 2B achieves 18.7 BLEU versus AV-HuBERT’s 13.9 (4.8 BLEU gain).
  • Zero-shot AV: With only audio-fine-tuning, XLAVS-R reaches 37.2% WER in noisy AV mode—matching AV-finetuned performance. AV-HuBERT, by contrast, exhibits a large gap (≈56% vs 79% WER) (Han et al., 2024).

6. Ablation Studies and Modular Impact

Ablations illuminate the compound benefits of XLAVS-R’s architecture:

  • Single-round SSL with audio-only targets yields modest gains (1–2pp WER).
  • Learnable audio front-end contributes the largest reduction (6pp).
  • Multilingual AV pre-training over 9 languages (vs. English-only) improves by 2–3pp.
  • Large-scale audio pre-training closes 3–4pp further. These components are synergistic; their integration is necessary for maximal noise robustness (Han et al., 2024).

7. Design Insights, Recommendations, and Best Practices

Key principles extracted from XLAVS-R’s development and evaluation:

  • Pre-train extensively on audio-only SSL for broad phonetic/semantic knowledge—critical for low-resource domains.
  • Inject visual modality via AV SSL over small multilingual AV corpora, ensuring cross-modal representational alignment.
  • Robustness requires explicit noise augmentation during both pre-training and fine-tuning, aligning latent spaces across clean and noisy inputs.
  • Modality dropout must be incorporated to enable models to generalize when one stream is missing or unreliable.
  • Balance AV and audio-only batches during fine-tuning to enable strong zero-shot AV inference.
  • Scale model capacity in proportion to language and data diversity for effective cross-lingual transfer.

Practitioners addressing language coverage or noise robustness are recommended to follow the two-stage (audio-only → AV SSL) pipeline, maintain aggressive noise and modality dropout during training, and select model scaling to suit data availability (Han et al., 2024).

References

  • XLAVS-R: "XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception" (Han et al., 2024)

This framework demonstrates state-of-the-art performance under realistic noisy conditions, delivers strong multilingual and zero-shot AVSR ability, and establishes robust pre-training and fine-tuning protocols for reweighting modalities according to instantaneous input quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Noise-Robust AVSR Framework.