- The paper introduces a benchmark dataset (HAC) to quantify human judgment in code-switching ASR through minimal editing guidelines.
- It shows that combining transliteration with phonetic similarity methods improves correlation with human judgments compared to traditional metrics like WER and CER.
- Experimental results validate the robustness of these metrics across diverse recording setups and language pairs in multilingual environments.
Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition
Introduction
Code-switching (CS) presents unique challenges in the field of automatic speech recognition (ASR) due to the frequent alternation between languages within the same discourse. This phenomenon is prevalent across multilingual societies, necessitating robust and fair evaluation metrics for ASR systems that handle CS. The paper "Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition" takes a substantial step in this direction by introducing a reference benchmark dataset for CS speech recognition and examining a variety of evaluation metrics against human judgments.
The authors explore multiple metrics differing in representation, directness, granularity, and similarity computation methods. They achieve the highest correlation to human judgment using a combination of transliteration and text normalization methods. This approach mitigates the influence of cross-transcription errors and orthographic inconsistencies, which are common in code-switched languages with different scripts.
Development of Human Acceptability Corpus
The cornerstone of this research is the Human Acceptability Corpus (HAC), designed to quantify human judgment via minimal editing of ASR hypotheses. The dataset is derived from the ArzEn corpus, which features Egyptian Arabic-English CS conversational speech. The annotation process emphasizes reproducibility of the original audio’s meaning through minimal edits, adhering to guidelines such as script segregation, acceptable readability, and minimal editing to ensure the hypothesis is easily understandable.
Inter-annotator agreement was rigorously assessed to ensure the reliability of annotations, providing insight into the complexity of the task due to variance in word acceptability and choice stemming from unstandardized orthography. This complexity underscores the necessity for more nuanced evaluation techniques.
Evaluation Metrics
Orthographic Metrics
The study evaluates traditional orthographic metrics such as Word Error Rate (WER) and Character Error Rate (CER) alongside newer measures like Match Error Rate (MER) and Word Information Lost (WIL). These metrics are insufficient for handling cross-transcription challenges in CS. To alleviate these issues, the authors employ transliteration to unify scripts and reduce orthographic disparities, especially when applying normalization practices like Alif/Ya normalization.
Phonological Metrics
To overcome the literal focus of orthographic metrics, the authors propose the Phone Similarity Edit Distance (PSD), which is computed using IPA phonetic mapping. This phonological approach offers a language-independent, deterministic metric by scaling substitution penalties based on phonetic dissimilarity, thus enhancing evaluation accuracy in multilingual settings.
Semantic Metrics
The paper introduces semantic-based ASR evaluation by leveraging pretrained transformer models to compute the cosine similarity between reference and hypothesis embeddings. A novel aspect is translating hypotheses and references into monolingual forms to address cross-transcription problems. While semantic measures exhibit lower correlation compared to phonological and transliteration solutions, they outperform standard WER, reflecting potential for future enhancement with advances in CS machine translation systems.
Experimental Results
Sentence- and system-level evaluations demonstrate the limitations of conventional metrics like CER and WER, highlighting the advantages of transliteration and phonetic similarity methods. Transliteration followed by phonetic similarity consistently shows higher correlation with human post-editing effort than traditional metrics, confirming its validity as an advanced evaluation approach. Sentence-level assessment indicates consistent performance across different recording setups with various CS levels, further validating the robustness of the proposed metrics.
Conclusion
The research paves the way for improved evaluation standards in CS ASR systems, demonstrating that traditional metrics fail to adequately capture the intricacies introduced by cross-transcription and orthographic variation. Transliteration and phonetic approaches emerge as superior alternatives. Future work includes expanding the HAC for more language pairs and exploring the human acceptability of CS ASR across diverse linguistic contexts. These efforts are instrumental in achieving more accurate and reliable ASR systems in multilingual environments.