Scaling and Enhancing LLM-based AVSR: A Sparse Mixture of Projectors Approach (2505.14336v2)
Abstract: Audio-Visual Speech Recognition (AVSR) enhances robustness in noisy environments by integrating visual cues. While recent advances integrate LLMs into AVSR, their high computational cost hinders deployment in resource-constrained settings. To address this, we propose Llama-SMoP, an efficient Multimodal LLM that employs a Sparse Mixture of Projectors (SMoP) module to scale model capacity without increasing inference costs. By incorporating sparsely-gated mixture-of-experts (MoE) projectors, Llama-SMoP enables the use of smaller LLMs while maintaining strong performance. We explore three SMoP configurations and show that Llama-SMoP DEDR (Disjoint-Experts, Disjoint-Routers), which uses modality-specific routers and experts, achieves superior performance on ASR, VSR, and AVSR tasks. Ablation studies confirm its effectiveness in expert activation, scalability, and noise robustness.
- Umberto Cappellazzo (10 papers)
- Minsu Kim (115 papers)
- Stavros Petridis (64 papers)
- Daniele Falavigna (19 papers)
- Alessio Brutti (30 papers)