Toward Practical Automatic Speech Recognition and Post-Processing: a Call for Explainable Error Benchmark Guideline (2401.14625v1)
Abstract: Automatic speech recognition (ASR) outcomes serve as input for downstream tasks, substantially impacting the satisfaction level of end-users. Hence, the diagnosis and enhancement of the vulnerabilities present in the ASR model bear significant importance. However, traditional evaluation methodologies of ASR systems generate a singular, composite quantitative metric, which fails to provide comprehensive insight into specific vulnerabilities. This lack of detail extends to the post-processing stage, resulting in further obfuscation of potential weaknesses. Despite an ASR model's ability to recognize utterances accurately, subpar readability can negatively affect user satisfaction, giving rise to a trade-off between recognition accuracy and user-friendliness. To effectively address this, it is imperative to consider both the speech-level, crucial for recognition accuracy, and the text-level, critical for user-friendliness. Consequently, we propose the development of an Error Explainable Benchmark (EEB) dataset. This dataset, while considering both speech- and text-level, enables a granular understanding of the model's shortcomings. Our proposition provides a structured pathway for a more `real-world-centric' evaluation, a marked shift away from abstracted, traditional methods, allowing for the detection and rectification of nuanced system weaknesses, ultimately aiming for an improved user experience.
- Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp. 4218–4222, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.520.
- Asr context-sensitive error correction based on microsoft n-gram dataset. arXiv preprint arXiv:1203.5262, 2012.
- Userlibri: A dataset for asr personalization using only text. arXiv preprint arXiv:2207.00706, 2022.
- Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp. 1–5. IEEE, 2017.
- A simple and effective approach to automatic post-editing with transfer learning. arXiv preprint arXiv:1906.06253, 2019.
- Ci-avsr: A cantonese audio-visual speech datasetfor in-car command recognition. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 6786–6793, 2022.
- Correcting noisy ocr: Context beats confusion. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 45–51, 2014.
- Using confidence scores to improve hands-free speech based navigation in continuous dictation systems. ACM Transactions on Computer-Human Interaction (TOCHI), 11(4):329–356, 2004.
- Asr-robust spoken language understanding on asr-glue dataset. 2022.
- RED-ACE: Robust error detection for ASR using confidence embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2800–2808, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.180.
- Red-ace: Robust error detection for asr using confidence embeddings. arXiv preprint arXiv:2203.07172, 2022b.
- Vocalsound: A dataset for improving human vocal sounds recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 151–155. IEEE, 2022.
- A spelling correction model for end-to-end speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5651–5655. IEEE, 2019.
- A comparison study of interjectional characteristics between people who stutter and people who do not stutter. Communication Sciences and Disorders, 13(3):438–453, 2008.
- Correction of automatic speech recognition with transformer sequence-to-sequence model. In Icassp 2020-2020 ieee international conference on acoustics, speech and signal processing (icassp), pp. 7074–7078. IEEE, 2020.
- K-nct: Korean neural grammatical error correction gold-standard test set using novel error type classification criteria. IEEE Access, 10:118167–118175, 2022.
- Linguistically informed post-processing for asr error correction in sanskrit. Proc. Interspeech 2022, pp. 2293–2297, 2022.
- Kt-speech-crawler: Automatic dataset construction for speech recognition from youtube videos. arXiv preprint arXiv:1903.00216, 2019.
- Korean grammatical error correction based on transformer with copying mechanisms and grammatical noise implantation methods. Sensors, 21(8):2658, 2021.
- Fastcorrect: Fast error correction with edit alignment for automatic speech recognition. Advances in Neural Information Processing Systems, 34:21708–21719, 2021.
- Improving readability for automatic speech recognition transcription. Transactions on Asian and Low-Resource Language Information Processing, 2020.
- Automatic speech recognition post-processing for readability: Task, dataset and a two-stage pre-trained approach. IEEE Access, 10:117053–117066, 2022.
- Towards understanding asr error correction for medical conversations. In Proceedings of the first workshop on natural language processing for medical conversations, pp. 7–11, 2020a.
- Asr error correction and domain adaptation using machine translation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6344–6348. IEEE, 2020b.
- From wer and ril to mer and wil: improved evaluation measures for connected speech recognition. 01 2004.
- Adaptive edit-distance and regression approach for post-ocr text correction. In Maturity and Innovation in Digital Libraries: 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018, Hamilton, New Zealand, November 19-22, 2018, Proceedings 20, pp. 278–289. Springer, 2018.
- Neural machine translation with bert for post-ocr error detection and correction. In Proceedings of the ACM/IEEE joint conference on digital libraries in 2020, pp. 333–336, 2020.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE, 2015.
- Decoding strategies for improving low-resource machine translation. Electronics, 9(10):1562, 2020.
- Bts: Back transcription for speech-to-text post-processor using text-to-speech-to-text. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), pp. 106–116, 2021.
- Towards end-to-end spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5754–5758. IEEE, 2018.
- Supporting dictation speech recognition error correction: the impact of external information. Behaviour & Information Technology, 30(6):761–774, 2011.
- A study of dysfluency characteristics in normal adults and children in monologue. Speech Sciences, 12(3):49–57, 2005.
- BembaSpeech: A speech recognition corpus for the Bemba language. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 7277–7283, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.790.
- Mondegreen: A post-processing solution to speech recognition error correction for voice search queries. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 3569–3575, 2021.
- Multimodal error correction for speech user interfaces. ACM transactions on computer-human interaction (TOCHI), 8(1):60–98, 2001.
- Semi-supervised consensus labeling for crowdsourcing. In SIGIR 2011 workshop on crowdsourcing for information retrieval (CIR), pp. 1–6, 2011.
- Data augmentation for training dialog models robust to speech recognition errors. arXiv preprint arXiv:2006.05635, 2020.
- On modelling uncertainty in neural language generation for policy optimisation in voice-triggered dialog assistants. In 2nd Workshop on Conversational AI: Today’s Practice and Tomorrow’s Potential, NeurIPS, 2018.
- Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language, 21(2):393–422, 2007.
- An information theoretic measure of speech recognition performance. 1982.
- Towards standardizing korean grammatical error correction: Datasets and annotation. arXiv preprint arXiv:2210.14389, 2022.