Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Reference-less Quality Metric for Automatic Speech Recognition via Contrastive-Learning of a Multi-Language Model with Self-Supervision (2306.13114v1)

Published 21 Jun 2023 in cs.CL and eess.AS

Abstract: The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual ground-truth transcriptions that are time-consuming and expensive to obtain. This work proposes a multi-language referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions. To estimate the quality of ASR hypotheses, a pre-trained LLM (LM) is fine-tuned with contrastive learning in a self-supervised learning manner. In experiments conducted on several unseen test datasets consisting of outputs from top commercial ASR engines in various languages, the proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments, and also reduces WER by more than $7\%$ when used for ensembling hypotheses. The fine-tuned model and experiments are made available for the reproducibility: https://github.com/aixplain/NoRefER

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
  2. “Speaker-adapted confidence measures for asr using deep bidirectional recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 7, pp. 1198–1206, 2018.
  3. “Improving asr confidence scores for alexa using acoustic and hypothesis embeddings,” Proceedings of the International Speech Communication Association Conference, INTERSPEECH, 2019.
  4. Hui Jiang, “Confidence measures for speech recognition: A survey,” Speech communication, vol. 45, no. 4, pp. 455–470, 2005.
  5. “Estimating confidence scores on asr results using recurrent neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4999–5003.
  6. “Neural zero-inflated quality estimation model for automatic speech recognition system,” Proceedings of the International Speech Communication Association Conference, INTERSPEECH, pp. 606–610, 2019.
  7. “Word error rate estimation without asr output: E-wer2,” Proceedings of the International Speech Communication Association Conference, INTERSPEECH, pp. 616–620, 2020.
  8. “Wer-bert: Automatic wer estimation with bert in a balanced ordinal classification paradigm,” Proceedings of European Chapter of the Association for Computational Linguistics Conference, EACL, pp. 606–610, 2019.
  9. “Correcting automated and manual speech transcription errors using warped language models,” Proceedings of the International Speech Communication Association Conference, INTERSPEECH, pp. 921–925, 2021.
  10. “Results of the wmt20 metrics shared task,” in Proceedings of the Fifth Conference on Machine Translation, 2020, pp. 688–725.
  11. “Unbabel’s participation in the WMT20 metrics shared task,” in Proceedings of the Fifth Conference on Machine Translation, Nov. 2020, pp. 911–920.
  12. “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 15750–15758.
  13. “Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers,” arXiv preprint arXiv:2012.15828, 2020.
  14. “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019.
  15. “Adafactor: Adaptive learning rates with sublinear memory cost,” in International Conference on Machine Learning. PMLR, 2018, pp. 4596–4604.
  16. “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246.
  17. “Cmu-moseas: A multimodal language dataset for spanish, portuguese, german and french,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing. NIH Public Access, 2020, vol. 2020, p. 1801.
  18. “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.
  19. “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
  20. “Perplexity—a measure of the difficulty of speech recognition tasks,” The Journal of the Acoustical Society of America, vol. 62, no. S1, pp. S63–S63, 1977.
  21. Statistics without maths for psychology, Pearson education, 2007.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kamer Ali Yuksel (14 papers)
  2. Thiago Ferreira (9 papers)
  3. Ahmet Gunduz (22 papers)
  4. Mohamed Al-Badrashiny (6 papers)
  5. Golara Javadi (5 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.