Audio-Visual Efficient Conformer (AVEC)

Updated 29 December 2025

The paper presents a novel AVSR framework that integrates efficient Conformer encoders, early fusion, and optimized intermediate CTC losses to achieve state-of-the-art WER and improved noise robustness.
AVEC employs dedicated audio and visual front-ends with patch attention and ResNet backbones, reducing computational cost while ensuring accurate lip-reading and audio processing.
Its training procedure leverages extensive pre-training, aggressive data augmentation, and efficient fine-tuning to significantly cut training time and support real-time transcription.

Audio-Visual Efficient Conformer (AVEC) refers to a class of neural architectures for Audio-Visual Speech Recognition (AVSR) that integrate Efficient Conformer backbones, parameter/compute-efficient cross-modal fusion strategies, and advanced training losses to achieve robust, low-latency, and highly accurate speech transcription from both audio and video (typically lip region) modalities. AVEC architectures have demonstrated state-of-the-art Word Error Rate (WER) on standard audio-visual and lipreading benchmarks, marked improvements in noise robustness, and significantly reduced training and inference costs compared to prior dual-stream Conformer models (Burchi et al., 2023, Wang et al., 2024, Burchi et al., 2024).

1. Architectural Components

AVEC models are built from four core blocks: audio front-end, visual front-end, modality-specific Efficient Conformer encoders, and an audio-visual fusion/joint encoder.

Audio Front-End:

Input: raw waveform sampled at 16 kHz.
Feature pipeline: 20 ms Short-Time Fourier Transform (STFT) with 10 ms hop, 257-point FFT, 80-dimensional mel-filterbanks.
Initial convolutional stem: 2D Conv (3×3 kernel, 180 filters, stride 2×2), linear projection, frame-rate reduction to 40ms, yielding feature dimension $d_{model}=180$ .

Visual Front-End:

Input: grayscale (normalized) lip crop, 96×96 pixels.
Spatio-temporal extraction: 3D Conv (5×7×7, 64 filters, stride 1×2×2), max-pooling.
Backbone: Four-layer ResNet-18 (channels: 64→512); spatial global average pooling, linear projection to 256 dimensions, matching frame-rate at 40ms.

Efficient Conformer Encoders:

Audio: Multi-stage (e.g., 3 stages)—strided temporal downsampling (kernel size 15), feature dimensions increasing (180→256→360), Conformer blocks per stage.
Visual: Typically 2 stages, 256→360 feature dimension.
Block structure: Alternation of convolutional modules, feed-forward modules, and multi-head self-attention (MHSA), with default attention modified for efficiency (see below).
Patch Attention: Replaces grouped attention in initial audio stages; uses average pooling, reduces time-steps by $k$ , computes relative-positional MHSA on lower-res feature map, then nearest-neighbor upsampling, yielding ∼7–10% FLOP savings ( $k=3$ ).

Audio-Visual Fusion and Joint Processing:

Early fusion: Concatenation of audio and visual features at matched frame-rate/dimension ( $d=360$ ); feed-forward processing and projection back to 360-dim.
Joint Efficient Conformer encoder (typically 5 blocks, single stage, no downsampling) operates on fused features; final linear projection maps to vocabulary size for CTC decoding.

2. Training Objective and Learning Rules

AVEC training is primarily based on Connectionist Temporal Classification (CTC) loss, but incorporates intermediate CTC losses and cross-modal conditioning to enhance optimization and expressivity.

Intermediate CTC (InterCTC) Loss and Residual Conditioning:

Inserted regularly (e.g., every 3rd block) in audio, visual, and joint encoders.
Block $l$ output $X_l^{\mathrm{out}} \in \mathbb{R}^{T\times d}$ yields class probabilities:

$Z_l = \mathrm{Softmax}\bigl(W^c X_l^{\mathrm{out}}\bigr) \in \mathbb{R}^{T\times V}$

Residual conditioning into next block:

$X_{l+1}^{\mathrm{in}} = X_l^{\mathrm{out}} + W^{r}Z_l, \quad W^r \in \mathbb{R}^{V \times d}$

InterCTC loss at block $l$ :

$L_l^{\text{inter}} = -\log P(y|Z_l) = -\log \sum_{\pi \in \mathcal{B}_{\mathrm{CTC}}^{-1}(y)} \prod_{t=1}^T Z_{t,\pi_t}$

Total training loss averages all intermediate CTC losses ( $K$ total) and the final CTC loss, weighted by $\lambda$ :

$L = (1-\lambda) L^{\mathrm{CTC}}_{\text{final}} + \lambda \frac{1}{K}\sum_{k} L_k^{\text{inter}}, \quad \lambda=0.5$

This relaxes the conditional independence assumption of CTC and speeds convergence, especially in visually dominated (lipreading) regimes.

Alternative Cross-Modal Fusion Approaches:

Later variants introduce Dual Conformer Interaction Modules (DCIMs), which generate cross-modal fusion via paired Conformer blocks and lightweight bottleneck adapters for efficient feature exchange. DCIM layers propagate information symmetrically, optionally using learned gating schemes to weight intra- and inter-modal signals (Wang et al., 2024).

3. Training Procedure and Efficiency Optimizations

AVEC models are trained in several stages, with extensive regularization and data augmentation for both audio and video.

Pre-training:

Visual branches: Pre-trained as visual-only models (e.g., on LRW for 30 epochs) using word-level cross-entropy.
Audio and visual-specific encoders: Pre-trained on respective modalities.

Fine-tuning:

Joint audio-visual models: Trained on composite datasets (e.g., LRS2+LRS3).
Batch size: Up to 256 aggregated via gradient accumulation.
Optimizer: Adam (β₁=0.9, β₂=0.98), weight decay 1e-6; Noam learning rate schedule, 10k warmup steps, inverse-sqrt decay.

Data Augmentation:

Audio: SpecAugment (2 frequency masks, F=27; 5 time masks, $p_S=0.05$ ).
Video: Temporal masking (1 per second, max 0.4 s), random cropping (88×88), horizontal flip.

Training Length:

Audio-only: 200 epochs
Visual-only: 100 epochs
Audio-visual: 70 epochs (4× quicker convergence due to fusion and auxiliary losses).

Regularization:

Stochastic Weight Averaging (SWA) over final checkpoints.

Language Modeling:

Optional: 6-gram KenLM, or Transformer LM (GPT-3 small, pre-trained/fine-tuned), used in beam search and neural rescoring.

4. Experimental Setup and Empirical Results

Extensive empirical validation on LRS2 and LRS3 confirms the quantitative advantages of AVEC. Key results and metrics follow strict reporting conventions used in the underlying publications.

Model / Setting	LRS2 WER (%)	LRS3 WER (%)	Params (M)	Epochs to Best
Audio-only	2.8 / 2.4	2.1 / 2.0	61	200
Visual-only	32.6 / 29.8	39.2 / 37.5	61+	100
AVEC (audio-visual)	2.5 / 2.3	1.9 / 1.8	61	70
DCIM-AVSR (53M params)	2.04	1.68	53	-
AV-Fast-Conformer (A+V)	-	0.8	197	-

"no LM / +neural LM": LRS2/LRS3 numbers denote WER without and with LLM rescoring (Burchi et al., 2023, Wang et al., 2024, Burchi et al., 2024).
In noise (e.g., –5 dB SNR), AVEC outperforms audio-only by 15–50 percentage points, e.g., babble: audio-only 75.9%, AVEC (no aug): 33.5%, AVEC (with babble aug): 11.2% (Burchi et al., 2023).
Inference efficiency: Audio-only inverse real time factor (Inv RTF) ≈ 51×, visual-only ≈ 5.3×, audio-visual ≈ 4.8× (Intel i7 CPU, batch=1) (Burchi et al., 2023).

Ablation experiments confirm the following:

Visual back-end switch to Efficient Conformer: –2% absolute WER, –4B FLOPs.
Patch attention (vs. grouped, $k=3$ ): identical WER, –0.5B FLOPs/10 s.
Early fusion (concat+FF): 0.2–0.5% better WER than late/gated alternatives.
InterCTC: –3.6% VO WER, –0.3% AVEC WER, reduces modality collapse.

5. Efficiency, Scaling, and Extensions

Parameter and FLOP Efficiency:

DCIM-AVSR reduces parameter count to 53 M from 61 M (13.1% reduction), while further lowering WER by 6–7% and cutting fine-tuning GPU-hours by 40–50% (Wang et al., 2024).
This is accomplished by trading two Conformer layers per branch for lightweight adapters in DCIM fusion, and by freezing most backbone parameters during cross-modal fine-tuning.

Scaling Law Notation:

Let $P_{\text{front}}$ $P_{front}$ be conv/ResNet front-end parameters, $P_{\text{conf}}$ $P_{conf}$ Conformer layer cost ( $O(d^2)$ $O (d^{2})$ feed-forward, $O(d^2T)$ $O (d^{2} T)$ attention), $P_{\text{ad}}$ $P_{ad}$ adapter size:
- Standard: $P_{\text{base}} = P_{\text{front}} + N_a P_{\text{conf}}(d_a) + N_v P_{\text{conf}}(d_v)$
- DCIM: $P_{\text{DCIM}} = P_{\text{front}} + N_1 P_{\text{conf,eff}}(d_a) + N_2 P_{\text{conf}}(d_a) + N_3 P_{\text{conf}}(d_a) + N_v P_{\text{conf}}(d_v) + N_{\text{DCIM}} 2 P_{\text{ad}}$

Deployment and Extensibility:

Latency: Real-time or near real-time decoding (<30 ms/frame) feasible on commodity and embedded GPUs.
Selective fine-tuning of adapters enables sample-efficient adaptation to new domains/languages and supports streaming-mode variants.
Fast front-ends (subsampling, patch attention, and depthwise convolution) offer further compute and memory savings, as in multilingual AVEC-like Fast-Conformer designs (Burchi et al., 2024).

AVEC’s distinguishing properties relative to prior AVSR/ASR approaches:

Grouped and Patch Attention: Patch attention reduces early-MHSA complexity vs grouped attention without degrading performance, facilitating wider and deeper encoders at constant FLOPs (Burchi et al., 2023).
InterCTC/Intermediate Losses: Shared by recent Fast-Conformer architectures; significantly enhances lip-reading accuracy and prevents modality collapse (Burchi et al., 2024).
Early Cross-Modal Fusion: Outperforms late/gated or cross-attentional schemes in AVSR, both for noise-robustness and convergence rate.
Parameter Efficiency (DCIM): AVEC/AVSR models incorporating DCIM-alike adapters achieve both lower WER and resource usage compared to symmetric dual-encoder Conformer baselines (Wang et al., 2024).
Hybrid Loss and Modality Dropout: In Fast-Conformer variants, joint CTC/RNN-T loss and scheduled modality dropout allow robust unimodal inference and improved WER, particularly in noisy or multimodal-missing inputs (Burchi et al., 2024).

7. Datasets and Benchmarks

Benchmarking for AVEC architectures uses widely adopted audio-visual datasets:

Dataset	Type	Duration	Key Usage
LRW	Visual	488k clips	VSR pre-train
LRS2	Audio-Visual	224 h	train/test/fusion
LRS3	Audio-Visual	438 h	train/test/fusion
MuAViC	Multilingual AV	470 h	cross-lingual AVSR
VoxCeleb2+AVSpeech	Unlabeled AV	≈3900 h	auto-labeling

WER evaluations are always performed per standard protocol. Pre-trained models and configuration files for AVEC are publicly released to ensure reproducibility (Burchi et al., 2023). Fast-Conformer-based models further introduce large-scale synthetic transcription for low-resource languages, resulting in substantial average WER reductions on MuAViC (Burchi et al., 2024).

AVEC synthesizes architectural, algorithmic, and training innovations to produce AVSR systems that are both compute-efficient and empirically state-of-the-art for multilingual, noise-robust speech recognition (Burchi et al., 2023, Wang et al., 2024, Burchi et al., 2024).

Markdown Upgrade to Chat

References (3)

Audio-Visual Efficient Conformer for Robust Speech Recognition (2023)

DCIM-AVSR : Efficient Audio-Visual Speech Recognition via Dual Conformer Interaction Module (2024)

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Audio-Visual Efficient Conformer (AVEC).

Audio-Visual Efficient Conformer (AVEC)

1. Architectural Components

2. Training Objective and Learning Rules

3. Training Procedure and Efficiency Optimizations

4. Experimental Setup and Empirical Results

5. Efficiency, Scaling, and Extensions

7. Datasets and Benchmarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Audio-Visual Efficient Conformer (AVEC)

1. Architectural Components

2. Training Objective and Learning Rules

3. Training Procedure and Efficiency Optimizations

4. Experimental Setup and Empirical Results

5. Efficiency, Scaling, and Extensions

6. Comparative Analysis and Relationships to Related Work

7. Datasets and Benchmarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research