Papers
Topics
Authors
Recent
2000 character limit reached

Lightweight Transducer Based on Frame-Level Criterion (2409.13698v2)

Published 5 Sep 2024 in cs.CL, cs.SD, and eess.AS

Abstract: The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1456
  2. B. Zhang, D. Wu, Z. Peng, X. Song, Z. Yao, H. Lv, L. Xie, C. Yang, F. Pan, and J. Niu, “Wenet 2.0: More productive end-to-end speech recognition toolkit,” in Interspeech, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247779123
  3. Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,” in Proc. Interspeech 2022, 2022, pp. 2063–2067.
  4. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” ArXiv, vol. abs/2212.04356, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:252923993
  5. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
  6. W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:18165915
  7. A. Graves, “Sequence transduction with recurrent neural networks,” ArXiv, vol. abs/1211.3711, 2012. [Online]. Available: https://api.semanticscholar.org/CorpusID:17194112
  8. C.-C. Chiu, W. Han, Y. Zhang, R. Pang, S. Kishchenko, P. Nguyen, A. Narayanan, H. Liao, S. Zhang, A. Kannan, R. Prabhavalkar, Z. Chen, T. N. Sainath, and Y. Wu, “A comparison of end-to-end models for long-form speech recognition,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 889–896, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:207880479
  9. C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” ArXiv, vol. abs/1712.05382, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:3538865
  10. Y. Wang, Z. Chen, C. yong Zheng, Y. Zhang, W. Han, and P. Haghani, “Accelerating rnn-t training and inference using ctc guidance,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:253237856
  11. F. Kuang, L. Guo, W. Kang, L. Lin, M. Luo, Z. Yao, and D. Povey, “Pruned rnn-t for fast, memory-efficient asr training,” in Interspeech, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:250072778
  12. K. An, X. Shi, and S. Zhang, “Bat: Boundary aware transducer for memory-efficient and low-latency asr,” in Interspeech, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258822907
  13. K. Deng and P. C. Woodland, “Label-synchronous neural transducer for end-to-end asr,” ArXiv, vol. abs/2307.03088, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259360768
  14. T.-H. Zhang, D. Zhou, G. Zhong, J. Zhou, and B. Li, “Cif-t: A novel cif-based transducer architecture for automatic speech recognition,” ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:260164851
  15. L. Dong, C. Yi, J. Wang, S. Zhou, S. Xu, X. Jia, and B. Xu, “A comparison of label-synchronous and frame-synchronous end-to-end models for speech recognition,” ArXiv, vol. abs/2005.10113, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218719191
  16. H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:11949160
  17. X. Chen, Z. Meng, S. Parthasarathy, and J. Li, “Factorized neural transducer for efficient language model adaptation,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8132–8136, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:238259575
  18. R. Zhao, J. Xue, P. Parthasarathy, V. Miljanic, and J. Li, “Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:254246668
  19. H. Xu, F. Jia, S. Majumdar, S. Watanabe, and B. Ginsburg, “Multi-blank transducers for speech recognition,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:253384154
  20. D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:121321299
  21. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” ArXiv, vol. abs/2005.08100, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218674528
  22. H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” ArXiv, vol. abs/1402.1128, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:16904319
  23. J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” ArXiv, vol. abs/1506.07503, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:1921173
  24. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:6628106
  25. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  26. Y.-Y. Yang, M. Hira, Z. Ni, A. Chourdia, A. Astafurov, C. Chen, C. feng Yeh, C. Puhrsch, D. Pollack, D. Genzel, D. Greenberg, E. Z. Yang, J. Lian, J. Mahadeokar, J. Hwang, J. Chen, P. Goldsborough, P. Roy, S. Narenthiran, S. Watanabe, S. Chintala, V. Quenneville-B’elair, and Y. Shi, “Torchaudio: Building blocks for audio and speech processing,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6982–6986, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:240070436

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.