Audio-Visual Efficient Conformer (AVEC)
- The paper presents a novel AVSR framework that integrates efficient Conformer encoders, early fusion, and optimized intermediate CTC losses to achieve state-of-the-art WER and improved noise robustness.
- AVEC employs dedicated audio and visual front-ends with patch attention and ResNet backbones, reducing computational cost while ensuring accurate lip-reading and audio processing.
- Its training procedure leverages extensive pre-training, aggressive data augmentation, and efficient fine-tuning to significantly cut training time and support real-time transcription.
Audio-Visual Efficient Conformer (AVEC) refers to a class of neural architectures for Audio-Visual Speech Recognition (AVSR) that integrate Efficient Conformer backbones, parameter/compute-efficient cross-modal fusion strategies, and advanced training losses to achieve robust, low-latency, and highly accurate speech transcription from both audio and video (typically lip region) modalities. AVEC architectures have demonstrated state-of-the-art Word Error Rate (WER) on standard audio-visual and lipreading benchmarks, marked improvements in noise robustness, and significantly reduced training and inference costs compared to prior dual-stream Conformer models (Burchi et al., 2023, Wang et al., 31 Aug 2024, Burchi et al., 14 Mar 2024).
1. Architectural Components
AVEC models are built from four core blocks: audio front-end, visual front-end, modality-specific Efficient Conformer encoders, and an audio-visual fusion/joint encoder.
Audio Front-End:
- Input: raw waveform sampled at 16 kHz.
- Feature pipeline: 20 ms Short-Time Fourier Transform (STFT) with 10 ms hop, 257-point FFT, 80-dimensional mel-filterbanks.
- Initial convolutional stem: 2D Conv (3×3 kernel, 180 filters, stride 2×2), linear projection, frame-rate reduction to 40ms, yielding feature dimension .
Visual Front-End:
- Input: grayscale (normalized) lip crop, 96×96 pixels.
- Spatio-temporal extraction: 3D Conv (5×7×7, 64 filters, stride 1×2×2), max-pooling.
- Backbone: Four-layer ResNet-18 (channels: 64→512); spatial global average pooling, linear projection to 256 dimensions, matching frame-rate at 40ms.
Efficient Conformer Encoders:
- Audio: Multi-stage (e.g., 3 stages)—strided temporal downsampling (kernel size 15), feature dimensions increasing (180→256→360), Conformer blocks per stage.
- Visual: Typically 2 stages, 256→360 feature dimension.
- Block structure: Alternation of convolutional modules, feed-forward modules, and multi-head self-attention (MHSA), with default attention modified for efficiency (see below).
- Patch Attention: Replaces grouped attention in initial audio stages; uses average pooling, reduces time-steps by , computes relative-positional MHSA on lower-res feature map, then nearest-neighbor upsampling, yielding ∼7–10% FLOP savings ().
Audio-Visual Fusion and Joint Processing:
- Early fusion: Concatenation of audio and visual features at matched frame-rate/dimension (); feed-forward processing and projection back to 360-dim.
- Joint Efficient Conformer encoder (typically 5 blocks, single stage, no downsampling) operates on fused features; final linear projection maps to vocabulary size for CTC decoding.
2. Training Objective and Learning Rules
AVEC training is primarily based on Connectionist Temporal Classification (CTC) loss, but incorporates intermediate CTC losses and cross-modal conditioning to enhance optimization and expressivity.
Intermediate CTC (InterCTC) Loss and Residual Conditioning:
- Inserted regularly (e.g., every 3rd block) in audio, visual, and joint encoders.
- Block output yields class probabilities:
- Residual conditioning into next block:
- InterCTC loss at block :
- Total training loss averages all intermediate CTC losses ( total) and the final CTC loss, weighted by :
- This relaxes the conditional independence assumption of CTC and speeds convergence, especially in visually dominated (lipreading) regimes.
Alternative Cross-Modal Fusion Approaches:
Later variants introduce Dual Conformer Interaction Modules (DCIMs), which generate cross-modal fusion via paired Conformer blocks and lightweight bottleneck adapters for efficient feature exchange. DCIM layers propagate information symmetrically, optionally using learned gating schemes to weight intra- and inter-modal signals (Wang et al., 31 Aug 2024).
3. Training Procedure and Efficiency Optimizations
AVEC models are trained in several stages, with extensive regularization and data augmentation for both audio and video.
Pre-training:
- Visual branches: Pre-trained as visual-only models (e.g., on LRW for 30 epochs) using word-level cross-entropy.
- Audio and visual-specific encoders: Pre-trained on respective modalities.
- Joint audio-visual models: Trained on composite datasets (e.g., LRS2+LRS3).
- Batch size: Up to 256 aggregated via gradient accumulation.
- Optimizer: Adam (β₁=0.9, β₂=0.98), weight decay 1e-6; Noam learning rate schedule, 10k warmup steps, inverse-sqrt decay.
Data Augmentation:
- Audio: SpecAugment (2 frequency masks, F=27; 5 time masks, ).
- Video: Temporal masking (1 per second, max 0.4 s), random cropping (88×88), horizontal flip.
Training Length:
- Audio-only: 200 epochs
- Visual-only: 100 epochs
- Audio-visual: 70 epochs (4× quicker convergence due to fusion and auxiliary losses).
Regularization:
- Stochastic Weight Averaging (SWA) over final checkpoints.
Language Modeling:
- Optional: 6-gram KenLM, or Transformer LM (GPT-3 small, pre-trained/fine-tuned), used in beam search and neural rescoring.
4. Experimental Setup and Empirical Results
Extensive empirical validation on LRS2 and LRS3 confirms the quantitative advantages of AVEC. Key results and metrics follow strict reporting conventions used in the underlying publications.
| Model / Setting | LRS2 WER (%) | LRS3 WER (%) | Params (M) | Epochs to Best |
|---|---|---|---|---|
| Audio-only | 2.8 / 2.4 | 2.1 / 2.0 | 61 | 200 |
| Visual-only | 32.6 / 29.8 | 39.2 / 37.5 | 61+ | 100 |
| AVEC (audio-visual) | 2.5 / 2.3 | 1.9 / 1.8 | 61 | 70 |
| DCIM-AVSR (53M params) | 2.04 | 1.68 | 53 | - |
| AV-Fast-Conformer (A+V) | - | 0.8 | 197 | - |
- "no LM / +neural LM": LRS2/LRS3 numbers denote WER without and with LLM rescoring (Burchi et al., 2023, Wang et al., 31 Aug 2024, Burchi et al., 14 Mar 2024).
- In noise (e.g., –5 dB SNR), AVEC outperforms audio-only by 15–50 percentage points, e.g., babble: audio-only 75.9%, AVEC (no aug): 33.5%, AVEC (with babble aug): 11.2% (Burchi et al., 2023).
- Inference efficiency: Audio-only inverse real time factor (Inv RTF) ≈ 51×, visual-only ≈ 5.3×, audio-visual ≈ 4.8× (Intel i7 CPU, batch=1) (Burchi et al., 2023).
Ablation experiments confirm the following:
- Visual back-end switch to Efficient Conformer: –2% absolute WER, –4B FLOPs.
- Patch attention (vs. grouped, ): identical WER, –0.5B FLOPs/10 s.
- Early fusion (concat+FF): 0.2–0.5% better WER than late/gated alternatives.
- InterCTC: –3.6% VO WER, –0.3% AVEC WER, reduces modality collapse.
5. Efficiency, Scaling, and Extensions
Parameter and FLOP Efficiency:
- DCIM-AVSR reduces parameter count to 53 M from 61 M (13.1% reduction), while further lowering WER by 6–7% and cutting fine-tuning GPU-hours by 40–50% (Wang et al., 31 Aug 2024).
- This is accomplished by trading two Conformer layers per branch for lightweight adapters in DCIM fusion, and by freezing most backbone parameters during cross-modal fine-tuning.
Scaling Law Notation:
- Let be conv/ResNet front-end parameters, Conformer layer cost ( feed-forward, attention), adapter size:
- Standard:
- DCIM:
Deployment and Extensibility:
- Latency: Real-time or near real-time decoding (<30 ms/frame) feasible on commodity and embedded GPUs.
- Selective fine-tuning of adapters enables sample-efficient adaptation to new domains/languages and supports streaming-mode variants.
- Fast front-ends (subsampling, patch attention, and depthwise convolution) offer further compute and memory savings, as in multilingual AVEC-like Fast-Conformer designs (Burchi et al., 14 Mar 2024).
6. Comparative Analysis and Relationships to Related Work
AVEC’s distinguishing properties relative to prior AVSR/ASR approaches:
- Grouped and Patch Attention: Patch attention reduces early-MHSA complexity vs grouped attention without degrading performance, facilitating wider and deeper encoders at constant FLOPs (Burchi et al., 2023).
- InterCTC/Intermediate Losses: Shared by recent Fast-Conformer architectures; significantly enhances lip-reading accuracy and prevents modality collapse (Burchi et al., 14 Mar 2024).
- Early Cross-Modal Fusion: Outperforms late/gated or cross-attentional schemes in AVSR, both for noise-robustness and convergence rate.
- Parameter Efficiency (DCIM): AVEC/AVSR models incorporating DCIM-alike adapters achieve both lower WER and resource usage compared to symmetric dual-encoder Conformer baselines (Wang et al., 31 Aug 2024).
- Hybrid Loss and Modality Dropout: In Fast-Conformer variants, joint CTC/RNN-T loss and scheduled modality dropout allow robust unimodal inference and improved WER, particularly in noisy or multimodal-missing inputs (Burchi et al., 14 Mar 2024).
7. Datasets and Benchmarks
Benchmarking for AVEC architectures uses widely adopted audio-visual datasets:
| Dataset | Type | Duration | Key Usage |
|---|---|---|---|
| LRW | Visual | 488k clips | VSR pre-train |
| LRS2 | Audio-Visual | 224 h | train/test/fusion |
| LRS3 | Audio-Visual | 438 h | train/test/fusion |
| MuAViC | Multilingual AV | 470 h | cross-lingual AVSR |
| VoxCeleb2+AVSpeech | Unlabeled AV | ≈3900 h | auto-labeling |
WER evaluations are always performed per standard protocol. Pre-trained models and configuration files for AVEC are publicly released to ensure reproducibility (Burchi et al., 2023). Fast-Conformer-based models further introduce large-scale synthetic transcription for low-resource languages, resulting in substantial average WER reductions on MuAViC (Burchi et al., 14 Mar 2024).
AVEC synthesizes architectural, algorithmic, and training innovations to produce AVSR systems that are both compute-efficient and empirically state-of-the-art for multilingual, noise-robust speech recognition (Burchi et al., 2023, Wang et al., 31 Aug 2024, Burchi et al., 14 Mar 2024).