CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition (2307.14132v4)
Abstract: RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models.
- A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
- T. N. Sainath, Y. He, B. Li, A. Narayanan, R. Pang et al., “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6059–6063.
- J. Li, R. Zhao, Z. Meng, Y. Liu, W. Wei et al., “Developing RNN-T models surpassing high-performance hybrid models with customization capability,” in 21st Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, pp. 3590–3594.
- C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain, K. Schubert, C. Fuegen, and M. L. Seltzer, “Transformer-transducer: End-to-end speech recognition with self-attention,” arXiv preprint arXiv:1910.12977, 2019.
- F. Kuang, L. Guo, W. Kang, L. Lin, M. Luo et al., “Pruned RNN-T for fast, memory-efficient ASR training,” in 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, pp. 2068–2072.
- J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving rnn transducer modeling for end-to-end speech recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, 2019, pp. 114–121.
- Y. Wang, Z. Chen, C. Zheng, Y. Zhang, W. Han et al., “Accelerating rnn-t training and inference using ctc guidance,” arXiv preprint arXiv:2210.16481, 2022.
- A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Machine Learning, Proceedings of the Twenty-Third International Conference, ICML, vol. 148, 2006, pp. 369–376.
- J. Kim, H. Lu, A. Tripathi, Q. Zhang, and H. Sak, “Reducing streaming ASR model delay with self alignment,” in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, pp. 3440–3444.
- J. Yu, C. Chiu, B. Li, S. Chang, T. N. Sainath et al., “Fastemit: Low-latency streaming ASR with sequence-level emission regularization,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 6004–6008.
- T. N. Sainath, R. Pang, D. Rybach, B. García, and T. Strohman, “Emitting word timings with end-to-end models,” in 21st Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, pp. 3615–3619.
- H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum latency training strategies for streaming sequence-to-sequence ASR,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6064–6068.
- H. Shrivastava, A. Garg, Y. Cao, Y. Zhang, and T. N. Sainath, “Echo state speech recognition,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 5669–5673.
- L. Dong and B. Xu, “CIF: continuous integrate-and-fire for end-to-end speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6079–6083.
- L. Dong, C. Yi, J. Wang, S. Zhou, S. Xu et al., “A comparison of label-synchronous and frame-synchronous end-to-end models for speech recognition,” arXiv preprint arXiv:2005.10113, 2020.
- F. Yu, H. Luo, P. Guo, Y. Liang, Z. Yao et al., “Boundary and context aware training for cif-based non-autoregressive end-to-end ASR,” in Automatic Speech Recognition and Understanding Workshop, ASRU, 2021, pp. 328–334.
- Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,” in 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, pp. 2063–2067.
- Y. Higuchi, N. Chen, Y. Fujita, H. Inaguma, T. Komatsu et al., “A comparative study on non-autoregressive modelings for speech-to-text generation,” in Automatic Speech Recognition and Understanding Workshop, ASRU, 2021, pp. 47–54.
- Z. Dai, G. Lai, Y. Yang, and Q. Le, “Funnel-transformer: Filtering out sequential redundancy for efficient language processing,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2020.
- C. Zhang, B. Li, Z. Lu, T. N. Sainath, and S. Chang, “Improving the fusion of acoustic and text representations in RNN-T,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2022, pp. 8117–8121.
- A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2013, pp. 6645–6649.
- E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid autoregressive transducer (HAT),” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6139–6143.
- G. Saon, Z. Tüske, D. Bolaños, and B. Kingsbury, “Advancing RNN transducer technology for speech recognition,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 5654–5658.
- S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba et al., “Espnet: End-to-end speech processing toolkit,” in 19th Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018, pp. 2207–2211.
- Z. Yao, D. Wu, X. Wang, B. Zhang, F. Yu et al., “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, pp. 4054–4058.
- B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang et al., “WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2022, pp. 6182–6186.
- Z. You, S. Feng, D. Su, and D. Yu, “3m: Multi-loss, multi-path and multi-level neural networks for speech recognition,” in 13th International Symposium on Chinese Spoken Language Processing, ISCSLP, 2022, pp. 170–174.
- H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline,” in 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA, 2017, pp. 1–5.
- T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in 16th Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015, pp. 3586–3589.
- D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” in 20th Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019, pp. 2613–2617.
- O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Commun., vol. 25, no. 1-3, pp. 133–147, 1998.
- R. Botros, T. N. Sainath, R. David, E. Guzman, W. Li, and Y. He, “Tied & reduced RNN-T decoder,” in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, pp. 4563–4567.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- Y. Yang, M. Hira, Z. Ni, A. Astafurov, C. Chen et al., “Torchaudio: Building blocks for audio and speech processing,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2022.
- Tian-Hao Zhang (9 papers)
- Dinghao Zhou (7 papers)
- Guiping Zhong (3 papers)
- Jiaming Zhou (41 papers)
- Baoxiang Li (6 papers)