Quantifying and improving code-switching in multilingual ASR

Investigate rigorous methodologies to quantify and improve code-switching performance in multilingual automatic speech recognition systems by (i) training on synthetic code-switching datasets constructed in a manner similar to the paper’s benchmark that concatenates LibriSpeech and Multilingual LibriSpeech segments across languages, and (ii) characterizing the trade-off between enforcing explicit language tokens during decoding and code-switching transcription accuracy.

Background

The paper evaluates code-switching by constructing synthetic benchmarks that concatenate segments from LibriSpeech (English) and MLS (Spanish, French, German) to produce mixed-language audio, comparing Universal-1 (Conformer RNN-T) against Whisper large-v3 and Canary-1B.

The authors find Universal-1 to be more robust to code-switching, but state that their benchmark is an initial exploration and explicitly leave further, more systematic investigation open, including training on similarly constructed code-switching data and analyzing the impact of explicit language tokens on performance.

References

This benchmark was our first foray into quantifying code-switching, and we leave this open as an area for further work. Possible areas to explore include training on data created in a fashion similar to the benchmark we used and studying the trade-off between the use of explicit language tokens and code-switching performance.

Anatomy of Industrial Scale Multilingual ASR (2404.09841 - Ramirez et al., 15 Apr 2024) in Subsection "Code-switching" (Experimental Results)