Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Alternating Weak Triphone/BPE Alignment Supervision from Hybrid Model Improves End-to-End ASR (2402.15594v1)

Published 23 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: In this paper, alternating weak triphone/BPE alignment supervision is proposed to improve end-to-end model training. Towards this end, triphone and BPE alignments are extracted using a pre-existing hybrid ASR system. Then, regularization effect is obtained by cross-entropy based intermediate auxiliary losses computed on such alignments at a mid-layer representation of the encoder for triphone alignments and at the encoder for BPE alignments. Weak supervision is achieved through strong label smoothing with parameter of 0.5. Experimental results on TED-LIUM 2 indicate that either triphone or BPE alignment based weak supervision improves ASR performance over standard CTC auxiliary loss. Moreover, their combination lowers the word error rate further. We also investigate the alternation of the two auxiliary tasks during model training, and additional performance gain is observed. Overall, the proposed techniques result in over 10% relative error rate reduction over a CTC-regularized baseline system.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. R. Prabhavalkar, T. Hori et al., “End-to-End speech recognition: A survey,” arXiv preprint arXiv:2303.03329, 2023.
  2. J. Li, “Recent advances in end-to-end automatic speech recognition,” APSIPA Trans. on Signal and Information Processing, vol. 11, no. 1, 2022.
  3. D. Bahdanau, J. Chorowski et al., “End-to-end attention-based large vocabulary speech recognition,” in ICASSP, 2016.
  4. L. Lu, X. Zhang, and S. Renais, “On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition,” in ICASSP, 2016.
  5. A. Radford, J. Kim et al., “Robust speech recognition via large-scale weak supervision,” arXiv preprint arXiv:2212.04356, 2022.
  6. G. Kurata and K. Audhkhasi, “Guiding CTC posterior spike timings for improved posterior fusion and knowledge distillation,” in Interspeech, 2019, pp. 1616–1620.
  7. J. Mahadeokar, S. Yuan et al., “Alignment restricted streaming recurrent neural network transducer,” in SLT, 2021, pp. 52–59.
  8. J. Jiang, Y. Gao, and Z. Tuske, “Weak alignment supervision from hybrid model improves End-to-end ASR,” arXiv preprint arXiv:2311.14835, 2023.
  9. C. Szegedy, V. Vanhoucke et al., “Rethinking the inception architecture for computer vision,” in CVPR, 2016.
  10. S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017.
  11. W. Chan, N. Jaitly et al., “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
  12. A. Zeyer, P. Bahar et al., “A comparison of transformer and LSTM encoder decoder models for ASR,” in ASRU, 2019.
  13. A. Graves, S. Fernández et al., “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
  14. A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in ICML, 2014.
  15. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  16. K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer,” in ASRU, 2017.
  17. Y. He, T. Sainath et al., “Streaming end-to-end speech recognition for mobile devices,” in ICASSP, 2019.
  18. M. Jain, K. Schubert et al., “RNN-T for latency controlled ASR with improved beam search,” arXiv preprint arXiv:1911.01629, 2019.
  19. R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
  20. F. Boyer, Y. Shinohara et al., “A study of transducer based end-to-end asr with ESPnet: Architecture, auxiliary loss and decoding strategies,” in ASRU, 2021.
  21. M. Luong, Q. Le et al., “Multi-task sequence to sequence learning,” arXiv preprint arXiv:1511.06114, 2015.
  22. M. Crawshaw, “Multi-task learning with deep neural networks: A survey,” arXiv preprint arXiv:2009.09796, 2020.
  23. Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. on Knowledge and Data Engineering, vol. 34, no. 12, pp. 5586–5609, 2021.
  24. J. Lee and S. Watanabe, “Intermediate loss regularization for CTC-based speech recognition,” in ICASSP, 2021.
  25. C. Liu, F. Zhang et al., “Improving RNN transducer based ASR with auxiliary tasks,” in SLT, 2021.
  26. A. Søgaard and Y. Goldberg, “Deep multi-task learning with low level tasks supervised at lower layers,” in ACL, 2016.
  27. S. Toshniwal, H. Tang et al., “Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition,” Interspeech, 2017.
  28. A. Zeyer, T. Alkhouli, and H. Ney, “RETURNN as a generic flexible neural toolkit with application to translation and speech recognition,” in ACL, 2018.
  29. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  30. A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional LSTM networks for improved phoneme classification and recognition,” in ICANN, 2005.
  31. D. Park, W. Chan et al., “SpecAugment: A simple data augmentation method for automatic speech recognition,” Interspeech, 2019.
  32. A. Rousseau, P. Deléglise et al., “Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks,” in LREC, 2014.
  33. D. Rybach, S. Hahn et al., “RASR-The RWTH Aachen University open source speech recognition toolkit,” in ASRU, 2011.
  34. M. Bisani and H. Ney, “Joint-sequence models for grapheme-to-phoneme conversion,” Speech communication, vol. 50, no. 5, pp. 434–451, 2008.
  35. D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  36. J. Fiscus, “SCTK, the NIST scoring toolkit,” https://github.com/usnistgov/SCTK, 2021.
  37. M. Zeineldeen, A. Zeyer et al., “Chunked attention-based encoder-decoder model for streaming speech recognition,” arXiv preprint arXiv:2309.08436, 2023.
  38. J. Gemmeke, D. Ellis et al., “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017.
  39. Y. Gao, C. Herold et al., “Revisiting checkpoint averaging for neural machine translation,” arXiv preprint arXiv:2210.11803, 2022.
  40. A. Gulati, J. Qin et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech, 2020, pp. 5036–5040.
  41. K. Kim, F. Wu et al., “E-Branchformer: Branchformer with enhanced merging for speech recognition,” in SLT, 2023, pp. 84–91.
  42. S. Majumdar, J. Balam et al., “Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition,” arXiv preprint arXiv:2104.01721, 2021.

Summary

We haven't generated a summary for this paper yet.