Adapting the adapters for code-switching in multilingual ASR

Published 11 Oct 2023 in cs.CL, cs.SD, and eess.AS | (2310.07423v1)

Abstract: Recently, large pre-trained multilingual speech models have shown potential in scaling Automatic Speech Recognition (ASR) to many low-resource languages. Some of these models employ language adapters in their formulation, which helps to improve monolingual performance and avoids some of the drawbacks of multi-lingual modeling on resource-rich languages. However, this formulation restricts the usability of these models on code-switched speech, where two languages are mixed together in the same utterance. In this work, we propose ways to effectively fine-tune such models on code-switched speech, by assimilating information from both language adapters at each language adaptation point in the network. We also model code-switching as a sequence of latent binary sequences that can be used to guide the flow of information from each language adapter at the frame level. The proposed approaches are evaluated on three code-switched datasets encompassing Arabic, Mandarin, and Hindi languages paired with English, showing consistent improvements in code-switching performance with at least 10\% absolute reduction in CER across all test sets.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces novel adapter configurations, namely PACS and TCS, to optimize multilingual ASR performance on code-switched speech.
It employs a Wav2Vec 2.0 framework with dynamic, frame-level binary switching to effectively modulate language adapter outputs.
Experiments on Mandarin-English, Arabic-English, and Hindi-English datasets demonstrate significant reductions in Character Error Rates (CER) and Mixed Error Rates (MER).

Adapting the Adapters for Code-Switching in Multilingual ASR

The paper "Adapting the adapters for code-switching in multilingual ASR" (2310.07423) presents methods for fine-tuning large pre-trained multilingual speech models on code-switched speech. The goal is to address the limitations inherent in the existing formulations of language adapters, particularly when dealing with code-switching scenarios where two languages are mixed within the same utterance.

Introduction and Motivation

Multilingual ASR systems incorporating language adapters can enhance performance by capturing language-specific features while sharing parameters for cross-lingual transfer. However, code-switching, an inherently challenging phenomenon in spoken language, degrades performance due to the need for models to recognize and predict mixed language patterns within an utterance. The paper proposes solutions to enable effective fine-tuning of such models on code-switched speech by integrating information from multiple language adapters. Specifically, it conceptualizes the code-switching process as a sequence of latent binary sequences guiding information flow from each language adapter at the frame level.

Proposed Approaches

This research introduces two innovative methods for adapting adapters in the MMS ASR architecture, which relies on the Wav2Vec 2.0 framework:

Post Adapter Code Switching (PACS)

PACS integrates outputs from two language adapters corresponding to the matrix and embedded languages, merging them at each language adaptation point in the network.

Architecture: As depicted in the framework (Figure 1a), the outputs from language adapters are concatenated and processed through a PACS network layer, which modulates hidden outputs to leverage cross-language information flow effectively.
Training: PACS networks are fine-tuned using smaller amounts of code-switched data while freezing other model parameters.
Implementation: LLM heads corresponding to the involved languages are concatenated and fine-tuned to optimize CS recognition based on pre-existing language-specific knowledge.
Figure 1: Framework of proposed approaches with MMS, a) Post Adapter Switching Approach b) Transformer Code Switching. Transformer blocks range from 1 to 48 and the grey color indicates frozen model parameters during training.

Transformer Code Switching (TCS)

TCS introduces a transformer network to estimate binary sequences demarcating CS points within utterances.

Architecture: TCS predicts frame-level CS points, allowing the network to switch adapter output paths dynamically based on estimated binary codes.
Activation: The transformer block uses a sigmoid activation function, partitioning the outputs into mask sequences that blend language-specific adapter outputs.
Output Regulation: The binary code sequence modulates the combination of adapter outputs through multiplication, enhancing the model's sensitivity to CS phenomena.

Data Preparation and Evaluation

The evaluation utilizes three code-switched datasets: ASCEND, ESCWA, and MUCS, representing Mandarin-English, Arabic-English, and Hindi-English language pairs. Baseline results indicate that pre-trained MMS models perform inadequately on CS tasks using a single adapter, emphasizing the need for CS-specific fine-tuning. The proposed approaches significantly reduce Character Error Rates (CER) across datasets by effectively integrating adapter outputs.

Experimental Setup

The models are fine-tuned using the CTC loss function with a learning rate adjustment strategy involving initial warm-up phases. Experiments are conducted on Nvidia A100 GPUs, utilizing the HuggingFace Transformers library. The results demonstrate substantial improvements in CER and Mixed Error Rate (MER) through the proposed approaches compared to baseline MMS and Whisper models.

Results and Discussion

Both PACS and TCS outperform direct fine-tuning of MMS language adapters, offering consistent reductions in error rates across all datasets. The TCS model demonstrates superior performance, underscoring its ability to dynamically adapt language representations at the frame level while maintaining a minimal increase in parameter count.

An intriguing observation is the Whisper model's inherent strength on CS tasks even without retraining, posing a benchmark for comparison. The proposed approaches, however, provide a modular solution for rapidly deploying multilingual ASR systems with CS capabilities, especially in lower-resourced language scenarios.

Conclusion

The paper introduces novel configurations for modulating language adapters in the MMS framework for enhanced CS speech recognition. By allowing controlled, frame-level language predictions, these strategies offer robust improvements without substantial increases in computational demands. Future work may explore integrating external LLMs or reinforcement learning techniques to further refine CS recognition capabilities.

Markdown Report Issue