- The paper introduces PIER (Point-of-Interest Error Rate), a novel metric designed to specifically evaluate ASR performance on code-switched words, addressing the limitations of traditional metrics like WER.
- Traditional metrics like WER can provide misleading results for code-switching, showing improvements while masking poor performance on code-switched words.
- Experiments using models like Whisper and MMS show that PIER offers deeper insights into code-switching handling, revealing difficulties like intra-word code-switching more effectively than WER.
The paper "PIER: A Novel Metric for Evaluating What Matters in Code-Switching" explores the challenges faced by Automatic Speech Recognition (ASR) systems in handling code-switching (CSW) scenarios. Code-switching, which involves the alternation of languages within a discourse, poses significant challenges for ASR systems that are typically evaluated using conventional metrics like Word Error Rate (WER) and Character Error Rate (CER). However, these metrics often fail to accurately measure ASR performance in code-switching contexts.
The authors of the paper question the applicability of traditional ASR metrics to code-switching scenarios, suggesting that these metrics may mask the true performance on code-switched words. The paper presents empirical findings using Connectionist Temporal Classification (CTC) and Encoder-Decoder models, showing that fine-tuning on non-code-switched data from both the matrix and embedded languages can improve classical evaluation metrics on code-switched test sets, but at the expense of actual code-switched word recognition.
To address the limitations of conventional metrics, the paper proposes the Point-of-Interest Error Rate (PIER), a novel evaluation metric focusing on specific words of interest, particularly those involved in code-switching. PIER isolates and measures the performance on code-switched words, providing a more accurate assessment of ASR systems in code-switching contexts. The metric is especially applicable to distinguishing between inter-word and intra-word code-switching, highlighting areas for improvement.
The paper includes detailed experiments using well-known multilingual models such as whisper-large-v3 (W-large), whisper-small (W-small), and the massively multilingual MMS model, across various datasets like Fisher, Arzen, and SEAME, covering language pairs such as Mandarin-English and Arabic-English. The experiments demonstrate that using PIER provides deeper insights into a model's ability to handle code-switching than WER alone, which often gives misleading performance improvements by emphasizing monolingual capabilities.
Furthermore, PIER allows for fine-grained analysis of ASR systems by distinguishing between inter-word and intra-word code-switching challenges, with results indicating that intra-word code-switching presents a significant difficulty. Evaluation using PIER reveals that what appears as improvement using conventional metrics may, in reality, conceal a deterioration in handling code-switching scenarios.
The paper concludes by emphasizing the value of PIER in providing a more focused evaluation of ASR systems in code-switching contexts, paving the way for targeted improvements and a deeper understanding of the complexities involved in multilingual speech recognition. The work invites further research and application of PIER to enhance the evaluation and modeling of ASR systems handling diverse linguistic phenomena.