LRS3 AV-ASR Benchmark: Multimodal Speech Recognition
- LRS3 AV-ASR Benchmark is a multimodal evaluation corpus combining 433 hours of TED talk excerpts with aligned audio and video data to test robust speech recognition under noisy conditions.
- It supports diverse architectures—from Conformer-based systems to video transformers and unified LLM-driven models—enabling rigorous, fair comparisons across approaches.
- The benchmark drives advancements in self-supervised learning, pseudo-labeling, and efficient token compression, leading to significant reductions in word error rates and enhanced real-time ASR performance.
The LRS3 AV-ASR benchmark is a critical evaluation corpus and protocol for automatic speech recognition leveraging both audio and video modalities, designed to rigorously assess the integration of multimodal information for robust transcription, especially under noisy conditions and data-scarce regimes. Its adoption has catalyzed methodological advances ranging from self-supervised representation learning and large-scale pseudo-labeling to unified LLM-based frameworks, making it the gold standard for the fair empirical comparison of AV-ASR systems.
1. Dataset Composition and Benchmark Protocol
The LRS3 dataset comprises approximately 433 hours of English-language TED and TEDx talk excerpts, with dense frame-level audio and video aligned to transcribed utterances. Standard splits consist of a 408 h pre-train partition, a 30 h “train-val” split for supervised fine-tuning, and a ≈1–1.6 h held-out test set (~1,300 utterances) (Serdyuk et al., 2022, Ren et al., 2023, Ma et al., 2022, Shi et al., 2022).
Audio Processing
- 16 kHz mono waveforms, log-Mel or filterbank features (26–80 dims), typical frame rates: 25–100 Hz.
- Augmentation: additive babble/pink/music noise (MUSAN, NOISEX-92), time-masking, SpecAugment.
Video Processing
- Face detection (Dlib, MediaPipe, RetinaFace), mouth region-of-interest (ROI) crops (e.g., 88×88, 96×96, 128×128 px), usually grayscale, frame rates of 25–33.3 Hz.
- Augmentation: horizontal flip, random crop, adaptive time masking, patch masking.
Metric
- Word Error Rate (WER):
with substitutions, deletions, insertions, and reference words.
Protocols may add babble or other noise at inference to gauge robustness, and all major studies report results using the official splits to ensure comparability.
2. Canonical and Advanced AV-ASR Architectures
Conformer-Based and Hybrid Models
Early state-of-the-art systems leverage a dual-stream architecture:
- Separate audio and video encoders (ResNet-18, Conformer stacks), fused via MLPs or addition, followed by joint CTC/seq2seq or attention-based decoders (Ma et al., 2022, Ma et al., 2023, Serdyuk et al., 2021).
- Innovations include streamable Conformer blocks (chunk-wise self-attention, causal convolution), triggered attention for streaming inference, and alignment regularization losses synchronizing encoder activations.
Video Transformer and Efficient Encoders
Recent work demonstrates video transformers (ViT, tubelet-based) can match or surpass 3D-conv baselines for mouth ROI feature extraction (Serdyuk et al., 2022, Serdyuk et al., 2021).
- Fine-tuned on LRS3, ViT front-ends yield up to 15% relative gains in lip-reading WER versus VGG-style convolutional nets.
Fusion Strategies and Modality Dropout
Fusion is achieved by concatenation, addition, or cross-modal attention; modality dropout randomly zeroes out either stream during training to ensure unimodal robustness (Lian et al., 2023, Shi et al., 2022, May et al., 2023, Rouditchenko et al., 2023).
Innovations such as gated linear units (GLU) and Flamingo blocks address stable fusion and enable decoders to weigh modalities dynamically under variable SNR (Li et al., 26 Jan 2026, Ren et al., 2023).
3. Self-Supervised, Semi-Supervised, and Pseudo-Labeling Methods
The LRS3 benchmark catalyzed the development of self-supervised pretext tasks (masked cluster prediction, context regression) and pseudo-labeling strategies.
AV-HuBERT and Successors
AV-HuBERT (Shi et al., 2022), the seminal approach, iteratively clusters fused features, masking multi-stream inputs and predicting phonetic units. With 1,759 h unlabeled AV data, lip-reading WER drops from 33.6% (prior SOTA, 31K h transcript) to 26.9% (433 h labeled plus self-training). For audio-only ASR, pre-training with AV clusters yields a SOTA WER of 1.3% (against 2.3% previous best).
AV-data2vec and Joint Contextualization
AV-data2vec (Lian et al., 2023) builds a joint (audio-video) transformer encoder trained to regress contextualized targets from a momentum teacher, leveraging early fusion and multi-layer averaging. On LRS3, audio-visual fine-tuning (433 h) achieves 1.7% WER (Large, 1,759 h SSL), −35.7% relative improvement over audio-only, with modality dropout scheduling critical for parameter efficiency and transfer across tasks.
Continuous Pseudo-Labeling Without External Teacher
AV-CPL (Rouditchenko et al., 2023) replaces static external pseudo-labels (as in Auto-AVSR (Ma et al., 2023)) with on-the-fly, EMA-teacher-generated targets, striking a trade-off between ASR and VSR performance via modality dropout hyperparameter . For VSR, WER drops by 21.7% absolutely (Base, p_m'=0.1).
Large-Scale Pseudo-Labeling and Data Expansion
Auto-AVSR (Ma et al., 2023) demonstrates large-scale noisy pseudo-labels (from Conformer, HuBERT, Whisper teachers) substantially reduce WER, saturating audio-only improvement, but continuing to benefit VSR. Best reported AVSR WER on LRS3 is 0.9% using 3,448 h of training data (818 h manually labeled, 2,630 h pseudo-labeled).
4. LLM-Driven AV-ASR
Unified frameworks leveraging LLMs have emerged for AV-ASR thanks to high-dimensional token compression and elastic granularity.
Token Compression and Multi-Granularity
Omni-AVSR (Cappellazzo et al., 10 Nov 2025) and MMS-LLaMA (Yeo et al., 14 Mar 2025) employ audio and visual encoders (Whisper, AV-HuBERT) followed by matryoshka/early-fusion token compression, feeding a frozen LLaMA with LoRA adapters.
- Omni-AVSR achieves 1.0% WER (av-rates 4,2), outperforming baselines while using a single 58M param model across ASR, VSR, and AVSR.
- MMS-LLaMA compresses features to 3.5 multimodal tokens per second (versus 25 tps baseline), with negligible loss in accuracy (0.74% clean WER), and a 35.7% FLOPs reduction.
Decoding and Losses
Decoding employs beam search on next-token cross-entropy or hybrid scoring (joint CTC/seq2seq), possibly integrating external 4-gram LMs (Cappellazzo et al., 10 Nov 2025, Rouditchenko et al., 2023).
Trade-offs between accuracy, computational cost, and token length are managed by dynamic query allocations based on estimated speech rate.
5. Benchmark Performance and Comparative Analysis
Table: SOTA LRS3 AV-ASR Results from Recent Studies
| System | Training Data (h) | AVSR WER (%) | Robustness Under Noise |
|---|---|---|---|
| MMS-LLaMA (Yeo et al., 14 Mar 2025) | 1,759 | 0.74 | Maintained SOTA, dynamic tokens |
| Auto-AVSR (Ma et al., 2023) | 3,448 | 0.9 | SOTA with pseudo-labels |
| Omni-AVSR-ST (Cappellazzo et al., 10 Nov 2025) | 433 | 1.0 (rates 4,2) | 18% @ –5 dB SNR |
| AV-data2vec (Large, 1,759 h) (Lian et al., 2023) | 1,759 | 1.3 | – |
| AV-HuBERT (Large) (Shi et al., 2022) | 1,759 | 1.3 | – |
| Dual-Use Whisper (Li et al., 26 Jan 2026) | 1,929 | 4.08* | SOTA in noisy/MUSAN |
| Streaming Conformer (Ma et al., 2022) | 438 | 2.0 (offline) | – |
| VALLR (Visual-only) (Thomas et al., 27 Mar 2025) | 30 (labeled) | 18.7 | N/A |
*Under MUSAN babble, 0 dB SNR, using Whisper medium + AV-HuBERT visual.
Key findings
- SOTA WERs are now well below 1% (MMS-LLaMA, Auto-AVSR) given large-scale pseudo-labels and compressed LLM tokenization.
- Semi-supervised and SSL models (AV-data2vec, AV-HuBERT) approach SOTA at far lower annotation cost.
- Data scaling saturates audio-only gains, whereas VSR and AVSR continue to improve with pseudo-labeled augmentation.
6. Streaming and Practical ASR
Streaming AV-ASR architectures using chunk-wise and triggered attention enable real-time recognition, with delays as low as 1.06 s at the expense of a small WER degradation (2.0→2.6%) (Ma et al., 2022). Alignment regularization critically synchronizes modal streams, reducing prediction offsets to <6 ms across SNRs.
LLM-based approaches now support elastic inference, balancing model size, token compression, and accuracy for variable compute and latency budgets (Cappellazzo et al., 10 Nov 2025).
7. Outstanding Limitations and Future Directions
Key limitations include residual viseme ambiguity in visual-only models (VALLR (Thomas et al., 27 Mar 2025)), saturation of audio performance beyond ~1,500 h pseudo-labels, and sensitivity of SSL methods to hyperparameters (modality scheduler, EMA rate). At present, robust AVSR under severe acoustic noise (>–5 dB SNR) remains challenging, with best systems plateauing at WER ~18% (Cappellazzo et al., 10 Nov 2025, Li et al., 26 Jan 2026).
Emerging directions:
- Speaker-adaptive and cross-lingual AVSR models (Mandarin CSTS (Ren et al., 2023)).
- Large-scale, truly end-to-end multimodal LLMs with contextual reasoning over compressed tokens.
- Unified parameter-efficient adaptation (e.g., LoRA, matryoshka) minimizing model multiplicity.
- Deeper investigation of interpretable intermediate structures (phonemes, articulatory features) for visual-only and AVSR pipelines (Thomas et al., 27 Mar 2025).
The LRS3 AV-ASR benchmark remains the definitive standard for evaluating progress in multimodal speech recognition, combining rigor, scale, and diversity; its systematic use across architectures, training regimes, and scalability studies ensures that it continues to drive innovation and fair comparison in the field.