Multi-blank Transducers for Speech Recognition (2211.03541v2)
Abstract: This paper proposes a modification to RNN-Transducer (RNN-T) models for automatic speech recognition (ASR). In standard RNN-T, the emission of a blank symbol consumes exactly one input frame; in our proposed method, we introduce additional blank symbols, which consume two or more input frames when emitted. We refer to the added symbols as big blanks, and the method multi-blank RNN-T. For training multi-blank RNN-Ts, we propose a novel logit under-normalization method in order to prioritize emissions of big blanks. With experiments on multiple languages and datasets, we show that multi-blank RNN-T methods could bring relative speedups of over +90%/+139% to model inference for English Librispeech and German Multilingual Librispeech datasets, respectively. The multi-blank RNN-T method also improves ASR accuracy consistently. We will release our implementation of the method in the NeMo (https://github.com/NVIDIA/NeMo) toolkit.
- D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Workshop on Automatic Speech Recognition and Understanding, 2011.
- S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018.
- M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “Fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
- Y. Wang, T. Chen, H. Xu, S. Ding, H. Lv, Y. Shao, N. Peng, L. Xie, S. Watanabe, and S. Khudanpur, “Espresso: A fast end-to-end neural speech recognition toolkit,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
- O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Ginsburg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook et al., “Nemo: a toolkit for building AI applications using neural modules,” arXiv:1909.09577, 2019.
- M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong et al., “SpeechBrain: A general-purpose speech toolkit,” arXiv:2106.04624, 2021.
- J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
- W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
- A. Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711, 2012.
- J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving RNN transducer modeling for end-to-end speech recognition,” in Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
- F. Kuang, L. Guo, W. Kang, L. Lin, M. Luo, Z. Yao, and D. Povey, “Pruned RNN-T for fast, memory-efficient ASR training,” arXiv:2206.13236, 2022.
- M. Ghodsi, X. Liu, J. Apfel, R. Cabrera, and E. Weinstein, “RNN-Transducer with stateless prediction network,” in ICASSP, 2020.
- Z. Chen, W. Deng, T. Xu, and K. Yu, “Phone synchronous decoding with CTC lattice.” in Interspeech, 2016, pp. 1923–1927.
- V. Pratap, A. Hannun, G. Synnaeve, and R. Collobert, “Star temporal classification: Sequence classification with partially labeled data,” arXiv:2201.12208, 2022.
- Y. Shinohara and S. Watanabe, “Minimum latency training of sequence transducers for streaming end-to-end speech recognition,” in Proc. Interspeech 2022, 2022, pp. 2098–2102.
- H. Xu, K. Audhkhasi, Y. Huang, J. Emond, and B. Ramabhadran, “Regularizing word segmentation by creating misspellings.” in Interspeech, 2021, pp. 2561–2565.
- H. Xu, Y. Huang, Y. Zhu, K. Audhkhasi, and B. Ramabhadran, “Convolutional dropout and wordpiece augmentation for end-to-end speech recognition,” in ICASSP, 2021.
- J. Yu, C.-C. Chiu, B. Li, S.-y. Chang, T. N. Sainath, Y. He, A. Narayanan, W. Han, A. Gulati, Y. Wu et al., “FastEmit: Low-latency streaming ASR with sequence-level emission regularization,” in ICASSP, 2021.
- J. Mahadeokar, Y. Shangguan, D. Le, G. Keren, H. Su, T. Le, C.-F. Yeh, C. Fuegen, and M. L. Seltzer, “Alignment restricted streaming recurrent neural network transducer,” in Spoken Language Technology Workshop (SLT), 2021.
- H. Hu, R. Zhao, J. Li, L. Lu, and Y. Gong, “Exploring pre-training with alignments for RNN transducer based end-to-end speech recognition,” in ICASSP, 2020.
- Q. Zhang, H. Lu, H. Sak, A. Tripathi, E. McDermott, S. Koo, and S. Kumar, “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in ICASSP, 2020.
- W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati, R. Pang, and Y. Wu, “Contextnet: Improving convolutional neural networks for automatic speech recognition with global context,” arXiv:2005.03191, 2020.
- A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv:2005.08100, 2020.
- M. Radfar, R. Barnwal, R. V. Swaminathan, F.-J. Chang, G. P. Strimel, N. Susanj, and A. Mouchtaris, “ConvRNN-T: Convolutional augmented recurrent neural network transducers for streaming speech recognition.”
- J. Kim and Y. Lee, “Accelerating rnn transducer inference via one-step constrained beam search,” arXiv preprint arXiv:2002.03577, 2020.
- W. Kang, L. Guo, F. Kuang, L. Lin, M. Luo, Z. Yao, X. Yang, P. Żelasko, and D. Povey, “Fast and parallel decoding for transducer,” arXiv preprint arXiv:2211.00484, 2022.
- R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv:1508.07909, 2015.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.
- C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv:2101.00390, 2021.
- V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” arXiv:2012.03411, 2020.