Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-blank Transducers for Speech Recognition (2211.03541v2)

Published 4 Nov 2022 in eess.AS, cs.LG, and cs.SD

Abstract: This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (https://github.com/NVIDIA/NeMo) toolkit.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Workshop on Automatic Speech Recognition and Understanding, 2011.
  2. S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018.
  3. M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “Fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  4. Y. Wang, T. Chen, H. Xu, S. Ding, H. Lv, Y. Shao, N. Peng, L. Xie, S. Watanabe, and S. Khudanpur, “Espresso: A fast end-to-end neural speech recognition toolkit,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
  5. O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook et al., “Nemo: a toolkit for building AI applications using neural modules,” arXiv:1909.09577, 2019.
  6. M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong et al., “SpeechBrain: A general-purpose speech toolkit,” arXiv:2106.04624, 2021.
  7. J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
  8. W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
  9. A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
  10. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711, 2012.
  11. J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving RNN transducer modeling for end-to-end speech recognition,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
  12. F. Kuang, L. Guo, W. Kang, L. Lin, M. Luo, Z. Yao, and D. Povey, “Pruned RNN-T for fast, memory-efficient ASR training,” arXiv:2206.13236, 2022.
  13. M. Ghodsi, X. Liu, J. Apfel, R. Cabrera, and E. Weinstein, “RNN-Transducer with stateless prediction network,” in ICASSP, 2020.
  14. Z. Chen, W. Deng, T. Xu, and K. Yu, “Phone synchronous decoding with CTC lattice.” in Interspeech, 2016, pp. 1923–1927.
  15. V. Pratap, A. Hannun, G. Synnaeve, and R. Collobert, “Star temporal classification: Sequence classification with partially labeled data,” arXiv:2201.12208, 2022.
  16. Y. Shinohara and S. Watanabe, “Minimum latency training of sequence transducers for streaming end-to-end speech recognition,” in Proc. Interspeech 2022, 2022, pp. 2098–2102.
  17. H. Xu, K. Audhkhasi, Y. Huang, J. Emond, and B. Ramabhadran, “Regularizing word segmentation by creating misspellings.” in Interspeech, 2021, pp. 2561–2565.
  18. H. Xu, Y. Huang, Y. Zhu, K. Audhkhasi, and B. Ramabhadran, “Convolutional dropout and wordpiece augmentation for end-to-end speech recognition,” in ICASSP, 2021.
  19. J. Yu, C.-C. Chiu, B. Li, S.-y. Chang, T. N. Sainath, Y. He, A. Narayanan, W. Han, A. Gulati, Y. Wu et al., “FastEmit: Low-latency streaming ASR with sequence-level emission regularization,” in ICASSP, 2021.
  20. J. Mahadeokar, Y. Shangguan, D. Le, G. Keren, H. Su, T. Le, C.-F. Yeh, C. Fuegen, and M. L. Seltzer, “Alignment restricted streaming recurrent neural network transducer,” in Spoken Language Technology Workshop (SLT), 2021.
  21. H. Hu, R. Zhao, J. Li, L. Lu, and Y. Gong, “Exploring pre-training with alignments for RNN transducer based end-to-end speech recognition,” in ICASSP, 2020.
  22. Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in ICASSP, 2020.
  23. W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati, R. Pang, and Y. Wu, “Contextnet: Improving convolutional neural networks for automatic speech recognition with global context,” arXiv:2005.03191, 2020.
  24. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv:2005.08100, 2020.
  25. M. Radfar, R. Barnwal, R. V. Swaminathan, F.-J. Chang, G. P. Strimel, N. Susanj, and A. Mouchtaris, “ConvRNN-T: Convolutional augmented recurrent neural network transducers for streaming speech recognition.”
  26. J. Kim and Y. Lee, “Accelerating rnn transducer inference via one-step constrained beam search,” arXiv preprint arXiv:2002.03577, 2020.
  27. W. Kang, L. Guo, F. Kuang, L. Lin, M. Luo, Z. Yao, X. Yang, P. Żelasko, and D. Povey, “Fast and parallel decoding for transducer,” arXiv preprint arXiv:2211.00484, 2022.
  28. R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv:1508.07909, 2015.
  29. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  30. C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv:2101.00390, 2021.
  31. V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” arXiv:2012.03411, 2020.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com