Papers
Topics
Authors
Recent
Search
2000 character limit reached

Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM

Published 24 Sep 2024 in cs.SD, cs.AI, and eess.AS | (2409.15905v2)

Abstract: In this paper, we introduce a speech-conditioned LLM integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR). Specifically, we propose an Insertion and Deletion of Interruption Token (IDIT) mechanism for better transfer text generation ability of LLM to speech recognition task. We also present a connecter with MoE architecture that manages multiple languages efficiently. To further enhance the collaboration of multiple experts and leverage the understanding capabilities of LLM, we propose a two-stage progressive training strategy: 1) The connector is unfrozen and trained with language-specialized experts to map speech representations to the text space. 2) The connector and LLM LoRA adaptor are trained with the proposed IDIT mechanism and all experts are activated to learn general representations. Experimental results demonstrate that our method significantly outperforms state-of-the-art models, including end-to-end and large-scale audio-LLMs.

Citations (1)

Summary

  • The paper integrates a speech-conditioned LLM with a Mixture of Experts connector to dynamically select language-specialized models for improved code-switching recognition.
  • It introduces an IDIT mechanism to selectively insert or delete tokens, boosting transcript coherence and transfer learning from text to speech tasks.
  • Experimental results show significant accuracy gains over state-of-the-art ASR systems, underscoring the approach’s robustness in multilingual contexts.

The paper "Boosting Code-Switching ASR with Mixture of Experts Enhanced Speech-Conditioned LLM" presents a novel approach to enhance Automatic Speech Recognition (ASR) systems tackling the challenge of Code-Switching (CS). This refers to the phenomenon where speakers alternate between languages in a single conversation, posing a significant challenge for traditional ASR systems.

Key Contributions

  1. Speech-Conditioned LLM Integration:
    • The authors integrate a LLM conditioned on speech input. This integration facilitates the generation of more accurate transcriptions by leveraging the text generation capabilities of LLMs.
  2. Mixture of Experts (MoE) Connector:
    • A Mixture of Experts-based connector is employed to manage multiple languages efficiently. This modular approach allows the system to dynamically select language-specialized "experts" based on the input speech, thereby improving recognition accuracy in multilingual contexts.
  3. Insertion and Deletion of Interruption Token (IDIT) Mechanism:
    • The introduction of the IDIT mechanism aims to enhance the transfer learning capability of the LLM from text generation to speech recognition tasks. This method selectively adds or removes tokens to manage interruptions or switches in language, thereby maintaining the flow and coherence of the recognized text.
  4. Two-Stage Progressive Training Strategy:
    • The first stage involves training the unfrozen connector with language-specialized experts to map speech representations into a textual space. This step is crucial for leveraging specialized knowledge of distinct languages.
    • In the second stage, both the connector and LLM's LoRA adaptor are trained using the IDIT mechanism, and all experts are activated to learn generalized representations. This comprehensive training process ensures that the system can handle diverse linguistic inputs efficiently.

Experimental Results

The proposed method significantly outperforms current state-of-the-art models, including both end-to-end ASR systems and large-scale audio-LLMs. The experimental evaluation demonstrates improved accuracy and robustness in recognizing code-switched speech, highlighting the effectiveness of the Mixture of Experts architecture and the IDIT mechanism.

Implications

This research underscores the potential of combining LLMs with specialized architectures like Mixture of Experts to address complex ASR challenges, particularly in multilingual and code-switching contexts. By bridging the gap between language modeling and speech recognition, the proposed system offers a promising direction for future advancements in ASR technologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.