Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

Published 22 Nov 2022 in eess.AS, cs.CL, and cs.SD | (2211.16319v1)

Abstract: Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition. In this paper, we focus on the question of robust and fair evaluation metrics. To that end, we develop a reference benchmark data set of code-switching speech recognition hypotheses with human judgments. We define clear guidelines for minimal editing of automatic hypotheses. We validate the guidelines using 4-way inter-annotator agreement. We evaluate a large number of metrics in terms of correlation with human judgments. The metrics we consider vary in terms of representation (orthographic, phonological, semantic), directness (intrinsic vs extrinsic), granularity (e.g. word, character), and similarity computation method. The highest correlation to human judgment is achieved using transliteration followed by text normalization. We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.

Abstract PDF Upgrade to Chat

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a benchmark dataset (HAC) to quantify human judgment in code-switching ASR through minimal editing guidelines.
It shows that combining transliteration with phonetic similarity methods improves correlation with human judgments compared to traditional metrics like WER and CER.
Experimental results validate the robustness of these metrics across diverse recording setups and language pairs in multilingual environments.

Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

Introduction

Code-switching (CS) presents unique challenges in the field of automatic speech recognition (ASR) due to the frequent alternation between languages within the same discourse. This phenomenon is prevalent across multilingual societies, necessitating robust and fair evaluation metrics for ASR systems that handle CS. The paper "Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition" takes a substantial step in this direction by introducing a reference benchmark dataset for CS speech recognition and examining a variety of evaluation metrics against human judgments.

The authors explore multiple metrics differing in representation, directness, granularity, and similarity computation methods. They achieve the highest correlation to human judgment using a combination of transliteration and text normalization methods. This approach mitigates the influence of cross-transcription errors and orthographic inconsistencies, which are common in code-switched languages with different scripts.

Development of Human Acceptability Corpus

The cornerstone of this research is the Human Acceptability Corpus (HAC), designed to quantify human judgment via minimal editing of ASR hypotheses. The dataset is derived from the ArzEn corpus, which features Egyptian Arabic-English CS conversational speech. The annotation process emphasizes reproducibility of the original audio’s meaning through minimal edits, adhering to guidelines such as script segregation, acceptable readability, and minimal editing to ensure the hypothesis is easily understandable.

Inter-annotator agreement was rigorously assessed to ensure the reliability of annotations, providing insight into the complexity of the task due to variance in word acceptability and choice stemming from unstandardized orthography. This complexity underscores the necessity for more nuanced evaluation techniques.

Evaluation Metrics

Orthographic Metrics

The study evaluates traditional orthographic metrics such as Word Error Rate (WER) and Character Error Rate (CER) alongside newer measures like Match Error Rate (MER) and Word Information Lost (WIL). These metrics are insufficient for handling cross-transcription challenges in CS. To alleviate these issues, the authors employ transliteration to unify scripts and reduce orthographic disparities, especially when applying normalization practices like Alif/Ya normalization.

Phonological Metrics

To overcome the literal focus of orthographic metrics, the authors propose the Phone Similarity Edit Distance (PSD), which is computed using IPA phonetic mapping. This phonological approach offers a language-independent, deterministic metric by scaling substitution penalties based on phonetic dissimilarity, thus enhancing evaluation accuracy in multilingual settings.

Semantic Metrics

The paper introduces semantic-based ASR evaluation by leveraging pretrained transformer models to compute the cosine similarity between reference and hypothesis embeddings. A novel aspect is translating hypotheses and references into monolingual forms to address cross-transcription problems. While semantic measures exhibit lower correlation compared to phonological and transliteration solutions, they outperform standard WER, reflecting potential for future enhancement with advances in CS machine translation systems.

Experimental Results

Sentence- and system-level evaluations demonstrate the limitations of conventional metrics like CER and WER, highlighting the advantages of transliteration and phonetic similarity methods. Transliteration followed by phonetic similarity consistently shows higher correlation with human post-editing effort than traditional metrics, confirming its validity as an advanced evaluation approach. Sentence-level assessment indicates consistent performance across different recording setups with various CS levels, further validating the robustness of the proposed metrics.

Conclusion

The research paves the way for improved evaluation standards in CS ASR systems, demonstrating that traditional metrics fail to adequately capture the intricacies introduced by cross-transcription and orthographic variation. Transliteration and phonetic approaches emerge as superior alternatives. Future work includes expanding the HAC for more language pairs and exploring the human acceptability of CS ASR across diverse linguistic contexts. These efforts are instrumental in achieving more accurate and reliable ASR systems in multilingual environments.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

Summary

Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

Introduction

Development of Human Acceptability Corpus

Evaluation Metrics

Orthographic Metrics

Phonological Metrics

Semantic Metrics

Experimental Results

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (8)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

Summary

Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

Introduction

Development of Human Acceptability Corpus

Evaluation Metrics

Orthographic Metrics

Phonological Metrics

Semantic Metrics

Experimental Results

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research