Lightweight Transducer Based on Frame-Level Criterion (2409.13698v2)
Abstract: The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
- S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proceedings of Interspeech, 2018, pp. 2207–2211. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-1456
- B. Zhang, D. Wu, Z. Peng, X. Song, Z. Yao, H. Lv, L. Xie, C. Yang, F. Pan, and J. Niu, “Wenet 2.0: More productive end-to-end speech recognition toolkit,” in Interspeech, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247779123
- Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,” in Proc. Interspeech 2022, 2022, pp. 2063–2067.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” ArXiv, vol. abs/2212.04356, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:252923993
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
- W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:18165915
- A. Graves, “Sequence transduction with recurrent neural networks,” ArXiv, vol. abs/1211.3711, 2012. [Online]. Available: https://api.semanticscholar.org/CorpusID:17194112
- C.-C. Chiu, W. Han, Y. Zhang, R. Pang, S. Kishchenko, P. Nguyen, A. Narayanan, H. Liao, S. Zhang, A. Kannan, R. Prabhavalkar, Z. Chen, T. N. Sainath, and Y. Wu, “A comparison of end-to-end models for long-form speech recognition,” 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 889–896, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:207880479
- C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” ArXiv, vol. abs/1712.05382, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:3538865
- Y. Wang, Z. Chen, C. yong Zheng, Y. Zhang, W. Han, and P. Haghani, “Accelerating rnn-t training and inference using ctc guidance,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:253237856
- F. Kuang, L. Guo, W. Kang, L. Lin, M. Luo, Z. Yao, and D. Povey, “Pruned rnn-t for fast, memory-efficient asr training,” in Interspeech, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:250072778
- K. An, X. Shi, and S. Zhang, “Bat: Boundary aware transducer for memory-efficient and low-latency asr,” in Interspeech, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258822907
- K. Deng and P. C. Woodland, “Label-synchronous neural transducer for end-to-end asr,” ArXiv, vol. abs/2307.03088, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259360768
- T.-H. Zhang, D. Zhou, G. Zhong, J. Zhou, and B. Li, “Cif-t: A novel cif-based transducer architecture for automatic speech recognition,” ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:260164851
- L. Dong, C. Yi, J. Wang, S. Zhou, S. Xu, X. Jia, and B. Xu, “A comparison of label-synchronous and frame-synchronous end-to-end models for speech recognition,” ArXiv, vol. abs/2005.10113, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218719191
- H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:11949160
- X. Chen, Z. Meng, S. Parthasarathy, and J. Li, “Factorized neural transducer for efficient language model adaptation,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8132–8136, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:238259575
- R. Zhao, J. Xue, P. Parthasarathy, V. Miljanic, and J. Li, “Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:254246668
- H. Xu, F. Jia, S. Majumdar, S. Watanabe, and B. Ginsburg, “Multi-blank transducers for speech recognition,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:253384154
- D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:121321299
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” ArXiv, vol. abs/2005.08100, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:218674528
- H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition,” ArXiv, vol. abs/1402.1128, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:16904319
- J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” ArXiv, vol. abs/1506.07503, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:1921173
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:6628106
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- Y.-Y. Yang, M. Hira, Z. Ni, A. Chourdia, A. Astafurov, C. Chen, C. feng Yeh, C. Puhrsch, D. Pollack, D. Genzel, D. Greenberg, E. Z. Yang, J. Lian, J. Mahadeokar, J. Hwang, J. Chen, P. Goldsborough, P. Roy, S. Narenthiran, S. Watanabe, S. Chintala, V. Quenneville-B’elair, and Y. Shi, “Torchaudio: Building blocks for audio and speech processing,” ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6982–6986, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:240070436
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.