Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Weak Alignment Supervision from Hybrid Model Improves End-to-end ASR (2311.14835v2)

Published 24 Nov 2023 in cs.SD, cs.CL, and eess.AS

Abstract: In this paper, we aim to create weak alignment supervision from an existing hybrid system to aid the end-to-end modeling of automatic speech recognition. Towards this end, we use the existing hybrid ASR system to produce triphone alignments of the training audios. We then create a cross-entropy loss at a certain layer of the encoder using the derived alignments. In contrast to the general one-hot cross-entropy losses, here we use a cross-entropy loss with a label smoothing parameter to regularize the supervision. As a comparison, we also conduct the experiments with one-hot cross-entropy losses and CTC losses with loss weighting. The results show that placing the weak alignment supervision with the label smoothing parameter of 0.5 at the third encoder layer outperforms the other two approaches and leads to about 5\% relative WER reduction on the TED-LIUM 2 dataset over the baseline. We see similar improvements when applying the method out-of-the-box on a Tagalog end-to-end ASR system.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. R. Prabhavalkar, T. Hori et al., “End-to-end speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
  2. J. Li et al., “Recent advances in end-to-end automatic speech recognition,” APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022.
  3. D. Bahdanau, J. Chorowski et al., “End-to-end attention-based large vocabulary speech recognition,” in ICASSP.   IEEE, 2016.
  4. L. Lu, X. Zhang, and S. Renais, “On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition,” in ICASSP.   IEEE, 2016.
  5. M. Gales, S. Young et al., “The application of hidden markov models in speech recognition,” Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195–304, 2008.
  6. W. Chan, N. Jaitly et al., “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP.   IEEE, 2016.
  7. A. Zeyer, P. Bahar et al., “A comparison of transformer and lstm encoder decoder models for asr,” in ASRU.   IEEE, 2019.
  8. A. Graves, S. Fernández et al., “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
  9. A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in ICML.   PMLR, 2014.
  10. J. Lee and S. Watanabe, “Intermediate loss regularization for ctc-based speech recognition,” in ICASSP.   IEEE, 2021.
  11. S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in ICASSP.   IEEE, 2017.
  12. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  13. K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer,” in ASRU.   IEEE, 2017.
  14. Y. He, T. N. Sainath et al., “Streaming end-to-end speech recognition for mobile devices,” in ICASSP.   IEEE, 2019.
  15. C. Liu, F. Zhang et al., “Improving rnn transducer based asr with auxiliary tasks,” in SLT.   IEEE, 2021.
  16. M. Jain, K. Schubert et al., “RNN-T for latency controlled asr with improved beam search,” arXiv preprint arXiv:1911.01629, 2019.
  17. A. Vaswani, N. Shazeer et al., “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  18. A. Baevski, Y. Zhou et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020.
  19. M. Sang, H. Li et al., “Self-supervised speaker verification with simple siamese network and self-supervised regularization,” in ICASSP.   IEEE, 2022.
  20. Y. Gao, W. Wang et al., “Towards a better understanding of label smoothing in neural machine translation,” in 1st Conference of the Asia-Pacific Chapter of the ACL and the 10th International Joint Conference on NLP, 2020.
  21. F. Boyer, Y. Shinohara et al., “A study of transducer based end-to-end asr with espnet: Architecture, auxiliary loss and decoding strategies,” in ASRU.   IEEE, 2021.
  22. M.-T. Luong, Q. V. Le et al., “Multi-task sequence to sequence learning,” arXiv preprint arXiv:1511.06114, 2015.
  23. M. Crawshaw, “Multi-task learning with deep neural networks: A survey,” arXiv preprint arXiv:2009.09796, 2020.
  24. Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 12, pp. 5586–5609, 2021.
  25. A. Søgaard and Y. Goldberg, “Deep multi-task learning with low level tasks supervised at lower layers,” in ACL, 2016.
  26. Z. Yang, Y. Gao et al., “Predicting and using target length in neural machine translation,” in 1st Conference of the Asia-Pacific Chapter of the ACL and the 10th International Joint Conference on NLP, 2020.
  27. X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep neural network acoustic modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 9, pp. 1469–1477, 2015.
  28. D. S. Park, W. Chan et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019.
  29. S. Ruder, “An overview of multi-task learning in deep neural networks,” arXiv preprint arXiv:1706.05098, 2017.
  30. S. Toshniwal, H. Tang et al., “Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition,” Interspeech, pp. 3532–3536, 2017.
  31. C. Szegedy, V. Vanhoucke et al., “Rethinking the inception architecture for computer vision,” in CVPR, 2016.
  32. W. Chen, B. Yan et al., “Improving massively multilingual asr with auxiliary ctc objectives,” in ICASSP.   IEEE, 2023.
  33. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  34. A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional LSTM networks for improved phoneme classification and recognition,” in ICANN, 2005.
  35. A. Zeyer, T. Alkhouli, and H. Ney, “Returnn as a generic flexible neural toolkit with application to translation and speech recognition,” in ACL, 2018.
  36. N. Srivastava, G. Hinton et al., “Dropout: a simple way to prevent neural networks from overfitting,” The journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
  37. R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
  38. S. Bengio, O. Vinyals et al., “Scheduled sampling for sequence prediction with recurrent neural networks,” Advances in neural information processing systems, vol. 28, 2015.
  39. A. Rousseau, P. Deléglise et al., “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014.
  40. M. Abadi, P. Barham et al., “Tensorflow: A system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation, 2016.
  41. K. Kim, F. Wu et al., “E-branchformer: Branchformer with enhanced merging for speech recognition,” in SLT.   IEEE, 2022.
  42. M. Zeineldeen, A. Zeyer et al., “Chunked attention-based encoder-decoder model for streaming speech recognition,” arXiv preprint arXiv:2309.08436, 2023.
  43. D. Rybach, S. Hahn et al., “RASR-The RWTH Aachen University open source speech recognition toolkit,” in ASRU, 2011.
  44. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  45. J. Fiscus, “SCTK, the NIST scoring toolkit,” https://github.com/usnistgov/SCTK, 2021.
  46. A. Gulati, J. Qin et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, 2020.
  47. S. Majumdar, J. Balam et al., “Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition,” arXiv preprint arXiv:2104.01721, 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.