Omni-AVSR: Unified Multimodal Speech Recognition

Updated 11 November 2025

Omni-AVSR is a unified multimodal speech recognition system that integrates audio, visual, and audio-visual cues for robust performance.
The framework employs multi-granularity token compression and parameter-efficient LoRA adapters to balance accuracy and computational cost.
Fusion strategies such as early cross-modal and late decision-level fusion enable resilient recognition even in noisy or occluded environments.

Omni-AVSR refers to a class of unified multimodal speech recognition systems that integrate auditory, visual, and audio-visual cues within a single, adaptable architecture. The main objectives of Omni-AVSR frameworks are: (i) achieving robust recognition under adverse noise, occlusion, and overlap conditions; (ii) supporting both audio-only, video-only, and joint audio-visual recognition within a shared parameterization; (iii) enabling elastic inference where computational cost and accuracy can be traded off dynamically; and (iv) delivering state-of-the-art performance with high efficiency in both training and deployment. Recent advances leverage LLMs, multi-granularity "matryoshka" token compression, parameter-efficient fine-tuning via low-rank adapters, and multi-stream fusion to fulfill these goals.

1. Unified Model Architectures for Multimodal Speech Recognition

Omni-AVSR architectures instantiate a unified recognition backbone, typically based on large pre-trained autoregressive transformers (such as LLaMA or Qwen2.5), into which both audio and visual inputs are injected after appropriate preprocessing and embedding. The architecture employs:

Audio and Video Feature Extraction: Pre-trained encoders (e.g., Whisper-medium for audio, AV-HuBERT Large for video) transform raw input into frame-level feature embeddings.
Multi-Granularity Compression: Both audio and video features are compressed at multiple predefined rates (e.g., audio at stride 4 or 16; video at stride 2 or 5) via average pooling, supporting flexible trade-offs between accuracy and speed at inference.
Token Concatenation and Prompting: Task-specific prompts indicating ASR, VSR, or AVSR mode are concatenated with the compressed tokens prior to the LLM.
Parameter-Efficient Adaptation: A frozen LLM backbone is augmented with LoRA (Low-Rank Adapter) modules of three types: shared (Omni-LoRA-S), task-specific (Omni-LoRA-T), or both (Omni-LoRA-ST), which are injected into the query and value projections of attention layers.

A single model is jointly trained across all modes, with LoRA adapters activated only for the relevant task during inference, thus minimizing overhead and simplifying deployment (Cappellazzo et al., 10 Nov 2025).

2. Multi-granularity and Elastic Inference with the Matryoshka Paradigm

Omni-AVSR exploits the matryoshka paradigm for multi-granularity representation, wherein both modalities are represented at several temporal compressions, and any combination can be chosen at runtime. The main procedure involves:

Compression Function: Given encodings $Z^a$ (audio) and $Z^v$ (video), compressed representations for rates $a_i$ , $v_j$ are computed as:

$Z^{a_i} = Compress(Z^a; a_i),\quad Z^{v_j} = Compress(Z^v; v_j)$

where $Compress$ denotes average pooling over non-overlapping windows.

Elastic Trade-off: Inference time compute cost scales quadratically with input token count. For example, the largest reduction, going from (4,2) to (16,5), yields a $\sim16\times$ reduction in attention FLOPs and memory, at the cost of moderate degradation in WER.
Single-Model, Multi-Rate Training: Instead of exhaustively enumerating all combinations during joint training (which would require $C_A+C_V+C_AC_V$ passes), Omni-AVSR samples one rate per task per batch, reducing training compute by $\sim73\%$ (Cappellazzo et al., 10 Nov 2025).

This mechanism supports real-time, resource-constrained applications, and permits system operators to dial accuracy vs. speed without retraining or storing multiple models.

3. Fusion Strategies and Adaptation Mechanisms

Recent Omni-AVSR frameworks explore both token-level early fusion and late decision-level fusion:

Early Fusion: Architectures such as AVATAR perform early cross-modal fusion via a Multimodal Bottleneck Transformer (MBT), introducing learnable tokens that aggregate and redistribute joint modality context within the encoder stack. This facilitates strong interaction between spectrogram and visual tokens, especially in unconstrained settings where the speaker may be off-camera (Gabeur et al., 2022).
Late Fusion: Hybrid approaches train per-modality models and align their state posteriors via a recurrent Decision Fusion Net (DFN), using reliability indicators (entropy, dispersion, SNR, face detector confidence, etc.) as side inputs. The DFN is a deep BLSTM network that outputs fused posteriors for a WFST decoder and demonstrates superior noise robustness, even surpassing dynamic oracle stream weighting (Yu et al., 2021).
Parameter-Efficient LoRA Adaptation: LoRA modules inserted into LLM attention projections allow the model to retain core language understanding while specializing in each modality or combination mode. Three sharing strategies (S, T, ST) enable flexibility between parameter efficiency and task specialization.

The choice of fusion level depends on the desired interpretability, training infrastructure, and real-time constraints of the target application.

4. Performance, Robustness, and Scaling Laws

Empirical results establish Omni-AVSR as state-of-the-art in both efficiency and accuracy:

LRS2/LRS3 Performance: Omni-AVSR-ST achieves WER of 1.2% (ASR), 26.8% (VSR), and 1.0% (AVSR) on LRS3 under the finest rates (4,2), notably leading all prior unified systems despite a 3–6 $\times$ reduction in trainable parameter count (Cappellazzo et al., 10 Nov 2025).
Noise Robustness: Under babble noise (NOISEX-92) at SNRs from 5 down to –5 dB, WERs rise gracefully from 2.5% to 18.0%; these results consistently surpass prior LLM-based or hybrid baselines.
Comparison of Fusion Methods: BLSTM-DFN achieves a mean WER of 16.09% vs. 27.83% (audio-only) on LRS2 under heavy noise, representing a 42.2% relative improvement (Yu et al., 2021).
Overlapped Speech: Multi-channel audio-visual models employing mask-based MVDR or filter-and-sum beamforming (with visual attention) and joint CTC/Si-SNR fine-tuning reduce WER from 25.38% (audio-only) to 16.1% (AV) under simulated overlapped conditions (Yu et al., 2020).
Scaling Behavior: WER decreases roughly linearly with log(parameter count) for LLM backbones in the 1B–8B range but with diminishing returns versus inference cost. Models in the 1–3B range offer the best practical trade-off.

A plausible implication is that parameter-efficient unified architectures can now rival or outperform much larger specialist baselines across all input conditions and modes.

5. Extensions: Multi-View, Task Generalization, and Future Directions

Omni-AVSR’s design is directly extensible to further modalities and use cases:

Multi-View/Multi-Camera Inputs: Integration of multiple synchronized video encoders (one per camera or region, such as face/hands/objects) and hierarchical fusion enables the system to attend dynamically to reliable views in the presence of occlusion or viewpoint variation (Gabeur et al., 2022).
Semantic Priors: Branches using pretrained vision–LLMs (e.g., CLIP, VideoBERT) inject global semantic context via cross-attention to bottleneck tokens, boosting robustness to non-standard linguistic or visual scenes.
Adaptive Denoising: Joint training with an audio enhancement front-end (learned denoiser) under high noise, using skip connections and curriculum masking, improves recognition in extreme environments.
Efficient Deployment: Distillation to mixture-of-experts or dynamic routing architectures, quantization, and structured pruning are explicitly proposed for lightweight real-time edge deployments (Gabeur et al., 2022).
Open-Set and Domain Adaptation: Pretraining on massive, weakly labeled multi-modal data (e.g., HowTo100M, YouCook2, VisSpeech), with aggressive noise augmentation and domain adaptation, improves generalization to new environments, noise types, and speaker populations.

These principles facilitate expansion to multilingual, generative, or open-domain multimodal recognition, and indicate that Omni-AVSR systems are adaptable to emerging multimodal datasets and hardware.

6. Comparison to Prior AVSR and Real-Time Systems

In contrast to prior AVSR and ASR paradigms:

Single-Modality Specialists: Audio- or video-only systems show rapid performance degradation under noise, occlusion, or channel corruption; in large-vocabulary settings, audio-only baselines far surpass video-only, but their gap is closed via robust fusion (Yu et al., 2021).
Real-Time Processing: Streamable architectures such as AV Taris employ fixed windowed cross-modal attention and soft-count gating to realize low-recording-latency online decoding while integrating visual context and optimizing a jointly differentiable loss without forced alignments (Sterpu et al., 2020).
Classical Fusion: Traditional dynamic weighting or oracle-based late fusion is consistently outperformed by deep recurrent fusion networks that learn non-linear, temporally global weighting schemes from reliability features.

Omni-AVSR synthesizes and generalizes these advances within a parameter- and compute-efficient framework supporting multi-modal and multi-task flexibility.

7. Insights, Limitations, and Avenues for Further Research

Key findings across the literature are:

Cross-Task Synergy: Unified modeling of ASR, VSR, and AVSR yields substantial VSR gains (1–3% absolute WER), indicating that the LLM learns to transfer auditory priors into visual token processing (Cappellazzo et al., 10 Nov 2025).
Graceful Degradation: Performance degrades smoothly as token compression increases, enabling principled accuracy-latency scaling.
Parameter-Efficiency and Maintenance: LoRA-based adaptation allows a <60M trainable parameter delta atop a frozen 1B–3B LLM; this drastically reduces model management overhead.
Current Limitations: There remains a pronounced gap between ASR and VSR/WLAS modes under challenging noise or occlusion, and further improvements require better modeling of long-range visual context, finer-grained adaptive masking, and open-set generalization. A plausible implication is that future systems will need to incorporate dynamic policy networks for rate selection and online adaptation to non-stationary conditions.

Future research directions include multi-lingual training, multitask sequence generation (e.g., joint transcript and scene/action captioning), mixture-of-experts fusion for variable token streams, and integration of emerging audio-visual benchmarks with highly diverse environmental conditions.