A Token-Wise Beam Search Algorithm for RNN-T (2302.14357v2)
Abstract: Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for speech recognition are iterating over the time axis, such that one time step is decoded before moving on to the next time step. Those algorithms result in a large number of calls to the joint network, which were shown in previous work to be an important factor that reduces decoding speed. We present a decoding beam search algorithm that batches the joint network calls across a segment of time steps, which results in 20%-96% decoding speedups consistently across all models and settings experimented with. In addition, aggregating emission probabilities over a segment may be seen as a better approximation to finding the most likely model output, causing our algorithm to improve oracle word error rate by up to 11% relative as the segment size increases, and to slightly improve general word error rate.
- Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
- “Towards end-to-end speech recognition with recurrent neural networks,” in Proceedings of the 31th International Conference on Machine Learning, ICML 2014, pp. 1764–1772.
- “Streaming end-to-end speech recognition for mobile devices,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019, pp. 6381–6385.
- “Towards fast and accurate streaming end-to-end ASR,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 6069–6073.
- “A comparison of end-to-end models for long-form speech recognition,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp. 889–896.
- “A comparative study on non-autoregressive modelings for speech-to-text generation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 47–54.
- “Monotonic recurrent neural network transducer and decoding strategies,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 944–948.
- “A study of transducer based end-to-end ASR with ESPnet: Architecture, auxiliary loss and decoding strategies,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 16–23.
- “Factorized blank thresholding for improved runtime efficiency of neural transducers,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, to appear, pp. 7804–7808.
- “Alignment-length synchronous decoding for RNN transducer,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 7804–7808.
- “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021, pp. 6783–6787.
- “Deep shallow fusion for RNN-T personalization,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 251–257.
- “An investigation of monotonic transducers for large-scale automatic speech recognition,” in 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 324–330.
- “Hybrid autoregressive transducer (HAT),” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 6139–6143.
- “Less is more: Improved RNN-T decoding using limited label context and path merging,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021, pp. 5659–5663.
- “Alignment restricted streaming recurrent neural network transducer,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 52–59.
- “Contextual RNN-T for open domain ASR,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, pp. 11–15.
- “Scaling ASR improves zero and few shot learning,” in Interspeech 2022, 21st Annual Conference of the International Speech Communication Association, pp. 5135–5139.
- “Librispeech: an ASR corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, pp. 5206–5210.
- “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pp. 66–71.
- “ASR n-best fusion nets,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021, pp. 7618–7622.
- “Improving spoken language understanding by exploiting ASR n-best hypotheses,” arXiv preprint arXiv:2001.05284, 2020.
- “Joint contextual modeling for ASR correction and language understanding,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 6349–6353.
- “Joint decoding for speech recognition and semantic tagging,” in Interspeech 2012, 21st Annual Conference of the International Speech Communication Association, pp. 1067–1070.
- Gil Keren (22 papers)