The paper "Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM" presents a novel approach to enhance Automatic Speech Recognition (ASR) systems tackling the challenge of Code-Switching (CS). This refers to the phenomenon where speakers alternate between languages in a single conversation, posing a significant challenge for traditional ASR systems.
Key Contributions
- Speech-Conditioned LLM Integration:
- The authors integrate a LLM conditioned on speech input. This integration facilitates the generation of more accurate transcriptions by leveraging the text generation capabilities of LLMs.
- Mixture of Experts (MoE) Connector:
- A Mixture of Experts-based connector is employed to manage multiple languages efficiently. This modular approach allows the system to dynamically select language-specialized "experts" based on the input speech, thereby improving recognition accuracy in multilingual contexts.
- Insertion and Deletion of Interruption Token (IDIT) Mechanism:
- The introduction of the IDIT mechanism aims to enhance the transfer learning capability of the LLM from text generation to speech recognition tasks. This method selectively adds or removes tokens to manage interruptions or switches in language, thereby maintaining the flow and coherence of the recognized text.
- Two-Stage Progressive Training Strategy:
- The first stage involves training the unfrozen connector with language-specialized experts to map speech representations into a textual space. This step is crucial for leveraging specialized knowledge of distinct languages.
- In the second stage, both the connector and LLM's LoRA adaptor are trained using the IDIT mechanism, and all experts are activated to learn generalized representations. This comprehensive training process ensures that the system can handle diverse linguistic inputs efficiently.
Experimental Results
The proposed method significantly outperforms current state-of-the-art models, including both end-to-end ASR systems and large-scale audio-LLMs. The experimental evaluation demonstrates improved accuracy and robustness in recognizing code-switched speech, highlighting the effectiveness of the Mixture of Experts architecture and the IDIT mechanism.
Implications
This research underscores the potential of combining LLMs with specialized architectures like Mixture of Experts to address complex ASR challenges, particularly in multilingual and code-switching contexts. By bridging the gap between LLMing and speech recognition, the proposed system offers a promising direction for future advancements in ASR technologies.