Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BLSTM-Based Confidence Estimation for End-to-End Speech Recognition (2312.14609v1)

Published 22 Dec 2023 in eess.AS and cs.CL

Abstract: Confidence estimation, in which we estimate the reliability of each recognized token (e.g., word, sub-word, and character) in automatic speech recognition (ASR) hypotheses and detect incorrectly recognized tokens, is an important function for developing ASR applications. In this study, we perform confidence estimation for end-to-end (E2E) ASR hypotheses. Recent E2E ASR systems show high performance (e.g., around 5% token error rates) for various ASR tasks. In such situations, confidence estimation becomes difficult since we need to detect infrequent incorrect tokens from mostly correct token sequences. To tackle this imbalanced dataset problem, we employ a bidirectional long short-term memory (BLSTM)-based model as a strong binary-class (correct/incorrect) sequence labeler that is trained with a class balancing objective. We experimentally confirmed that, by utilizing several types of ASR decoding scores as its auxiliary features, the model steadily shows high confidence estimation performance under highly imbalanced settings. We also confirmed that the BLSTM-based model outperforms Transformer-based confidence estimation models, which greatly underestimate incorrect tokens.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, Nov. 2012.
  2. H. Jiang, “Confidence measures for speech recognition: A survey,” Speech Communication, vol. 45, no. 4, pp. 455–470, April 2005.
  3. R. San-Segundo, B. Pellom, K. Hacioglu, and W. Ward, “Confidence measures for spoken dialogue systems,” in Proc. ICASSP, 2001, pp. 393–396.
  4. P. Swarup, R. Maas, S. Garimella, S. H. Mallidi, and B. Hoffmeister, “Improving ASR confidence scores for Alexa using acoustic and hypothesis embeddings,” in Proc. Interspeech, 2019, pp. 2175–2179.
  5. E. Lleida and R. C. Rose, “Utterance verification in continuous speech recognition: Decoding and training procedures,” IEEE Transactions on Speech and Audio Processing, vol. 8, no. 2, pp. 126–139, 2000.
  6. F. Wessel, R. Schlüter, K. Macherey, and H. Ney, “Confidence measures for large vocabulary continuous speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 9, no. 3, pp. 288–298, 2001.
  7. L. Mangu, E. Brill, and A. Stolcke, “Finding consensus in speech recognition: Word error minimization and other applications of confusion networks,” Computer Speech and Language, vol. 14, no. 4, pp. 373–400, Oct. 2000.
  8. J. Fayolle, F. Moreau, C. Raymond, G. Gravier, and P. Gros, “CRF-based combination of contextual features to improve a posteriori word-level confidence measures,” in Proc. Interspeech, 2010, pp. 1942–1945.
  9. M. S. Seigel and P. C. Woodland, “Combining information sources for confidence estimation with CRF models,” in Proc. Interspeech, 2011, pp. 905–908.
  10. D. Yu, J. Li, and L. Deng, “Calibration of confidence measures in speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 8, pp. 2461–2473, Nov. 2011.
  11. T. Schaaf and T. Kemp, “Confidence measures for spontaneous speech recognition,” in Proc. ICASSP, 1997, pp. 875–878.
  12. M. Weintraub, F. Beaufays, Z. Rivlin, Y. Konig, and A. Stolcke, “Neural-network based measures of confidence for word recognition,” in Proc. ICASSP, 1997, pp. 887–890.
  13. Y.-C. Tam, Y. Lei, J. Zheng, and W. Wang, “ASR error detection using recurrent neural network language model and complementary ASR,” in Proc. ICASSP, 2014, pp. 2331–2335.
  14. K. Kalgaonkar, C. Liu, Y. Gong, and K. Yao, “Estimating confidence scores on ASR results using recurrent neural networks,” in Proc. ICASSP, 2015, pp. 4999–5003.
  15. A. Ogawa and T. Hori, “ASR error detection and recognition rate estimation using deep bidirectional recurrent neural networks,” in Proc. ICASSP, 2015, pp. 4370–4374.
  16. ——, “Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks,” Speech Communication, vol. 89, pp. 70–83, May 2017.
  17. M. Á. Del-Agu, A. Giménez, A. Sanchis, J. Civera, and A. Juan, “Speaker-adapted confidence measures for ASR using deep bidirectional recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 7, pp. 1198–1206, 2018.
  18. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
  19. M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, Nov. 1997.
  20. J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: First results,” arXiv:1412.1602 [cs.NE], 2014.
  21. W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP, 2016, pp. 4960–4964.
  22. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
  23. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “FAIRSEQ: A fast, extensible toolkit for sequence modeling,” in Proc. NAACL-HLT: Demonstrations, 2019, pp. 48–53.
  24. S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, S. Watanabe, T. Yoshimura, and W. Zhang, “A comparative study on Transformer vs RNN in speech applications,” in Proc. ASRU, 2019, pp. 449–456.
  25. A. Woodward, C. Bonnín, I. Masuda, D. Varas, E. Bou-Balust, and J. C. Riveiro, “Confidence measures in encoder-decoder models for speech recognition,” in Proc. Interspeech, 2020.
  26. A. Kumar, S. Singh, D. Gowda, A. Garg, S. Singh, and C. Kim, “Utterance confidence measure for end-to-end speech recognition with applications to distributed speech recognition scenarios,” in Proc. Interspeech, 2020.
  27. Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” in Proc. CVPR, 2019, pp. 9260–9269.
  28. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and L. Kaiser, “Attention is all you need,” in Proc. NIPS, 2017, pp. 5998–6008.
  29. Q. Guo, X. Qiu, P. Liu, Y. Shao, X. Xue, and Z. Zhang, “Star-Transformer,” in Proc. NAACL-HLT, 2019, pp. 1315–1325.
  30. H. Yan, B. Deng, X. Li, and X. Qiu, “TENER: Adapting Transformer encoder for named entity recognition,” arXiv:1911.04474 [cs.CL], 2019.
  31. A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. ICML, 2014, pp. 1764–1772.
  32. S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. ICASSP, 2017, pp. 4835–4839.
  33. S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/Attention architecture for end-to-end speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 11, no. 8, pp. 1240–1253, Dec. 2017.
  34. T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Proc. Interspeech, 2017, pp. 949–953.
  35. K. Maekawa, “Corpus of spontaneous Japanese: Its design and evaluation,” in Proc. Workshop on Spontaneous Speech Processing and Recognition (SSPR), 2003, pp. 7–12.
  36. S. Watanabe, “ESPnet: End-to-end speech processing toolkit,” https://github.com/espnet/espnet.
  37. A. Paszke et al., “PyTorch: An imperative style, high-performance deep learning library,” in Proc. NeurIPS, 2019, pp. 8024–8035.
  38. D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
  39. M. Siu and H. Gish, “Evaluation of word confidence for speech recognition systems,” Computer Speech and Language, vol. 13, no. 4, pp. 299–319, Oct. 1999.
  40. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proc. ICCV, 2017, pp. 2999–3007.
  41. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” in Proc. NAACL-HLT, 2019, pp. 4171–4186.
  42. T. B. Brown et al., “Language models are few-shot learners,” arXiv:2005.14165 [cs.CL], 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Atsunori Ogawa (15 papers)
  2. Naohiro Tawara (20 papers)
  3. Takatomo Kano (9 papers)
  4. Marc Delcroix (94 papers)
Citations (4)