H_eval: A new hybrid evaluation metric for automatic speech recognition tasks (2211.01722v3)
Abstract: Many studies have examined the shortcomings of word error rate (WER) as an evaluation metric for automatic speech recognition (ASR) systems. Since WER considers only literal word-level correctness, new evaluation metrics based on semantic similarity such as semantic distance (SD) and BERTScore have been developed. However, we found that these metrics have their own limitations, such as a tendency to overly prioritise keywords. We propose H_eval, a new hybrid evaluation metric for ASR systems that considers both semantic correctness and error rate and performs significantly well in scenarios where WER and SD perform poorly. Due to lighter computation compared to BERTScore, it offers 49 times reduction in metric computation time. Furthermore, we show that H_eval correlates strongly with downstream NLP tasks. Also, to reduce the metric calculation time, we built multiple fast and lightweight models using distillation techniques
- X. He, L. Deng, and A. Acero, “Why word error rate is not a good metric for speech recognizer training for the speech translation task?” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 5632–5635.
- B. Favre, K. Cheung, S. Kazemian, A. Lee, Y. Liu, C. Munteanu, A. Nenkova, D. Ochei, G. Penn, S. Tratz et al., “Automatic human utility evaluation of asr systems: Does wer really predict performance?” in INTERSPEECH, 2013, pp. 3463–3467.
- I. A. McCowan, D. Moore, J. Dines, D. Gatica-Perez, M. Flynn, P. Wellner, and H. Bourlard, “On the use of information retrieval measures for speech recognition evaluation,” IDIAP, Tech. Rep., 2004.
- A. C. Morris, V. Maier, and P. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition,” in Eighth International Conference on Spoken Language Processing, 2004.
- J. Makhoul, F. Kubala, R. Schwartz, R. Weischedel et al., “Performance measures for information extraction,” in Proceedings of DARPA broadcast news workshop. Herndon, VA, 1999, pp. 249–252.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
- S. Kim, A. Arora, D. Le, C.-F. Yeh, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Semantic distance: A new metric for asr performance analysis towards spoken language understanding,” arXiv preprint arXiv:2104.02138, 2021.
- S. Kim, D. Le, W. Zheng, T. Singh, A. Arora, X. Zhai, C. Fuegen, O. Kalinli, and M. L. Seltzer, “Evaluating user perception of speech recognition system quality with semantic distance metric,” arXiv preprint arXiv:2110.05376, 2021.
- N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” arXiv preprint arXiv:1908.10084, 2019.
- J. Tobin, Q. Li, S. Venugopalan, K. Seaver, R. Cave, and K. Tomanek, “Assessing asr model quality on disordered speech using bertscore,” arXiv preprint arXiv:2209.10591, 2022.
- T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” arXiv preprint arXiv:1904.09675, 2019.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- M. Grootendorst, “Keybert: Minimal keyword extraction with bert.” 2020. [Online]. Available: https://doi.org/10.5281/zenodo.4461265
- D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning. PMLR, 2016, pp. 173–182.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
- A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020.
- F. Rudzicz, A. Namasivayam, and T. Wolff, “The torgo database of acoustic and articulatory speech from speakers with dysarthria,” Language Resources and Evaluation, vol. 46, pp. 1–19, 01 2010.
- G. Tur, D. Hakkani-Tür, and L. Heck, “What is left to be understood in atis?” in 2010 IEEE Spoken Language Technology Workshop. IEEE, 2010, pp. 19–24.
- “Speechelo.” [Online]. Available: https://speechelo.com/
- M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd et al., “spacy: Industrial-strength natural language processing in python,” 2020.
- D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia, “Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation,” arXiv preprint arXiv:1708.00055, 2017.
- A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference,” arXiv preprint arXiv:1704.05426, 2017.
- S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” arXiv preprint arXiv:1508.05326, 2015.
- E. Agirre, D. Cer, M. Diab, and A. Gonzalez-Agirre, “Semeval-2012 task 6: A pilot on semantic textual similarity,” in * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, 2012, pp. 385–393.
- E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, and W. Guo, “* sem 2013 shared task: Semantic textual similarity,” in Second joint conference on lexical and computational semantics (* SEM), 2013, pp. 32–43.
- E. Agirre, C. Banea, C. Cardie, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, and J. Wiebe, “Semeval-2014 task 10: Multilingual semantic textual similarity.” in SemEval@ COLING, 2014, pp. 81–91.
- E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea et al., “Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability,” in Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), 2015, pp. 252–263.
- E. Agirre, C. Banea, D. Cer, M. Diab, A. Gonzalez Agirre, R. Mihalcea, G. Rigau Claramunt, and J. Wiebe, “Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation,” in SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics), 2016.
- M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, R. Zamparelli et al., “A sick cure for the evaluation of compositional distributional semantic models.” in Lrec. Reykjavik, 2014, pp. 216–223.