Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition (2307.14132v4)

Published 26 Jul 2023 in cs.SD, cs.CL, and eess.AS

Abstract: RNN-T models are widely used in ASR, which rely on the RNN-T loss to achieve length alignment between input audio and target sequence. However, the implementation complexity and the alignment-based optimization target of RNN-T loss lead to computational redundancy and a reduced role for predictor network, respectively. In this paper, we propose a novel model named CIF-Transducer (CIF-T) which incorporates the Continuous Integrate-and-Fire (CIF) mechanism with the RNN-T model to achieve efficient alignment. In this way, the RNN-T loss is abandoned, thus bringing a computational reduction and allowing the predictor network a more significant role. We also introduce Funnel-CIF, Context Blocks, Unified Gating and Bilinear Pooling joint network, and auxiliary training strategy to further improve performance. Experiments on the 178-hour AISHELL-1 and 10000-hour WenetSpeech datasets show that CIF-T achieves state-of-the-art results with lower computational overhead compared to RNN-T models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  2. T. N. Sainath, Y. He, B. Li, A. Narayanan, R. Pang et al., “A streaming on-device end-to-end model surpassing server-side conventional model quality and latency,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6059–6063.
  3. J. Li, R. Zhao, Z. Meng, Y. Liu, W. Wei et al., “Developing RNN-T models surpassing high-performance hybrid models with customization capability,” in 21st Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, pp. 3590–3594.
  4. C.-F. Yeh, J. Mahadeokar, K. Kalgaonkar, Y. Wang, D. Le, M. Jain, K. Schubert, C. Fuegen, and M. L. Seltzer, “Transformer-transducer: End-to-end speech recognition with self-attention,” arXiv preprint arXiv:1910.12977, 2019.
  5. F. Kuang, L. Guo, W. Kang, L. Lin, M. Luo et al., “Pruned RNN-T for fast, memory-efficient ASR training,” in 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, pp. 2068–2072.
  6. J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving rnn transducer modeling for end-to-end speech recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop, ASRU, 2019, pp. 114–121.
  7. Y. Wang, Z. Chen, C. Zheng, Y. Zhang, W. Han et al., “Accelerating rnn-t training and inference using ctc guidance,” arXiv preprint arXiv:2210.16481, 2022.
  8. A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Machine Learning, Proceedings of the Twenty-Third International Conference, ICML, vol. 148, 2006, pp. 369–376.
  9. J. Kim, H. Lu, A. Tripathi, Q. Zhang, and H. Sak, “Reducing streaming ASR model delay with self alignment,” in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, pp. 3440–3444.
  10. J. Yu, C. Chiu, B. Li, S. Chang, T. N. Sainath et al., “Fastemit: Low-latency streaming ASR with sequence-level emission regularization,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 6004–6008.
  11. T. N. Sainath, R. Pang, D. Rybach, B. García, and T. Strohman, “Emitting word timings with end-to-end models,” in 21st Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020, pp. 3615–3619.
  12. H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum latency training strategies for streaming sequence-to-sequence ASR,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6064–6068.
  13. H. Shrivastava, A. Garg, Y. Cao, Y. Zhang, and T. N. Sainath, “Echo state speech recognition,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 5669–5673.
  14. L. Dong and B. Xu, “CIF: continuous integrate-and-fire for end-to-end speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6079–6083.
  15. L. Dong, C. Yi, J. Wang, S. Zhou, S. Xu et al., “A comparison of label-synchronous and frame-synchronous end-to-end models for speech recognition,” arXiv preprint arXiv:2005.10113, 2020.
  16. F. Yu, H. Luo, P. Guo, Y. Liang, Z. Yao et al., “Boundary and context aware training for cif-based non-autoregressive end-to-end ASR,” in Automatic Speech Recognition and Understanding Workshop, ASRU, 2021, pp. 328–334.
  17. Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,” in 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022, pp. 2063–2067.
  18. Y. Higuchi, N. Chen, Y. Fujita, H. Inaguma, T. Komatsu et al., “A comparative study on non-autoregressive modelings for speech-to-text generation,” in Automatic Speech Recognition and Understanding Workshop, ASRU, 2021, pp. 47–54.
  19. Z. Dai, G. Lai, Y. Yang, and Q. Le, “Funnel-transformer: Filtering out sequential redundancy for efficient language processing,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2020.
  20. C. Zhang, B. Li, Z. Lu, T. N. Sainath, and S. Chang, “Improving the fusion of acoustic and text representations in RNN-T,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2022, pp. 8117–8121.
  21. A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2013, pp. 6645–6649.
  22. E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid autoregressive transducer (HAT),” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6139–6143.
  23. G. Saon, Z. Tüske, D. Bolaños, and B. Kingsbury, “Advancing RNN transducer technology for speech recognition,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 5654–5658.
  24. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba et al., “Espnet: End-to-end speech processing toolkit,” in 19th Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018, pp. 2207–2211.
  25. Z. Yao, D. Wu, X. Wang, B. Zhang, F. Yu et al., “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, pp. 4054–4058.
  26. B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang et al., “WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2022, pp. 6182–6186.
  27. Z. You, S. Feng, D. Su, and D. Yu, “3m: Multi-loss, multi-path and multi-level neural networks for speech recognition,” in 13th International Symposium on Chinese Spoken Language Processing, ISCSLP, 2022, pp. 170–174.
  28. H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline,” in 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA, 2017, pp. 1–5.
  29. T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in 16th Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015, pp. 3586–3589.
  30. D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” in 20th Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019, pp. 2613–2617.
  31. O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Commun., vol. 25, no. 1-3, pp. 133–147, 1998.
  32. R. Botros, T. N. Sainath, R. David, E. Guzman, W. Li, and Y. He, “Tied & reduced RNN-T decoder,” in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, pp. 4563–4567.
  33. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  34. Y. Yang, M. Hira, Z. Ni, A. Astafurov, C. Chen et al., “Torchaudio: Building blocks for audio and speech processing,” in International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Tian-Hao Zhang (9 papers)
  2. Dinghao Zhou (7 papers)
  3. Guiping Zhong (3 papers)
  4. Jiaming Zhou (41 papers)
  5. Baoxiang Li (6 papers)
Citations (2)