Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM (2409.15905v2)

Published 24 Sep 2024 in cs.SD, cs.AI, and eess.AS

Abstract: In this paper, we introduce a speech-conditioned LLM integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR). Specifically, we propose an Insertion and Deletion of Interruption Token (IDIT) mechanism for better transfer text generation ability of LLM to speech recognition task. We also present a connecter with MoE architecture that manages multiple languages efficiently. To further enhance the collaboration of multiple experts and leverage the understanding capabilities of LLM, we propose a two-stage progressive training strategy: 1) The connector is unfrozen and trained with language-specialized experts to map speech representations to the text space. 2) The connector and LLM LoRA adaptor are trained with the proposed IDIT mechanism and all experts are activated to learn general representations. Experimental results demonstrate that our method significantly outperforms state-of-the-art models, including end-to-end and large-scale audio-LLMs.

PDF Abstract

The paper "Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM" presents a novel approach to enhance Automatic Speech Recognition (ASR) systems tackling the challenge of Code-Switching (CS). This refers to the phenomenon where speakers alternate between languages in a single conversation, posing a significant challenge for traditional ASR systems.

Key Contributions

Speech-Conditioned LLM Integration:
- The authors integrate a LLM conditioned on speech input. This integration facilitates the generation of more accurate transcriptions by leveraging the text generation capabilities of LLMs.
Mixture of Experts (MoE) Connector:
- A Mixture of Experts-based connector is employed to manage multiple languages efficiently. This modular approach allows the system to dynamically select language-specialized "experts" based on the input speech, thereby improving recognition accuracy in multilingual contexts.
Insertion and Deletion of Interruption Token (IDIT) Mechanism:
- The introduction of the IDIT mechanism aims to enhance the transfer learning capability of the LLM from text generation to speech recognition tasks. This method selectively adds or removes tokens to manage interruptions or switches in language, thereby maintaining the flow and coherence of the recognized text.
Two-Stage Progressive Training Strategy:
- The first stage involves training the unfrozen connector with language-specialized experts to map speech representations into a textual space. This step is crucial for leveraging specialized knowledge of distinct languages.
- In the second stage, both the connector and LLM's LoRA adaptor are trained using the IDIT mechanism, and all experts are activated to learn generalized representations. This comprehensive training process ensures that the system can handle diverse linguistic inputs efficiently.

Experimental Results

The proposed method significantly outperforms current state-of-the-art models, including both end-to-end ASR systems and large-scale audio-LLMs. The experimental evaluation demonstrates improved accuracy and robustness in recognizing code-switched speech, highlighting the effectiveness of the Mixture of Experts architecture and the IDIT mechanism.

Implications

This research underscores the potential of combining LLMs with specialized architectures like Mixture of Experts to address complex ASR challenges, particularly in multilingual and code-switching contexts. By bridging the gap between LLMing and speech recognition, the proposed system offers a promising direction for future advancements in ASR technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Fengrun Zhang (5 papers)
Wang Geng (4 papers)
Hukai Huang (8 papers)
Cheng Yi (5 papers)
He Qu (4 papers)
Yahui Shan (4 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/AudioAndSpeech/status/1839091250332307887