Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Token-Wise Beam Search Algorithm for RNN-T (2302.14357v2)

Published 28 Feb 2023 in cs.LG, cs.SD, and eess.AS

Abstract: Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for speech recognition are iterating over the time axis, such that one time step is decoded before moving on to the next time step. Those algorithms result in a large number of calls to the joint network, which were shown in previous work to be an important factor that reduces decoding speed. We present a decoding beam search algorithm that batches the joint network calls across a segment of time steps, which results in 20%-96% decoding speedups consistently across all models and settings experimented with. In addition, aggregating emission probabilities over a segment may be seen as a better approximation to finding the most likely model output, causing our algorithm to improve oracle word error rate by up to 11% relative as the segment size increases, and to slightly improve general word error rate.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  2. “Towards end-to-end speech recognition with recurrent neural networks,” in Proceedings of the 31th International Conference on Machine Learning, ICML 2014, pp. 1764–1772.
  3. “Streaming end-to-end speech recognition for mobile devices,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019, pp. 6381–6385.
  4. “Towards fast and accurate streaming end-to-end ASR,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 6069–6073.
  5. “A comparison of end-to-end models for long-form speech recognition,” in 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pp. 889–896.
  6. “A comparative study on non-autoregressive modelings for speech-to-text generation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 47–54.
  7. “Monotonic recurrent neural network transducer and decoding strategies,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 944–948.
  8. “A study of transducer based end-to-end ASR with ESPnet: Architecture, auxiliary loss and decoding strategies,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 16–23.
  9. “Factorized blank thresholding for improved runtime efficiency of neural transducers,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, to appear, pp. 7804–7808.
  10. “Alignment-length synchronous decoding for RNN transducer,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 7804–7808.
  11. “Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021, pp. 6783–6787.
  12. “Deep shallow fusion for RNN-T personalization,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 251–257.
  13. “An investigation of monotonic transducers for large-scale automatic speech recognition,” in 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 324–330.
  14. “Hybrid autoregressive transducer (HAT),” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 6139–6143.
  15. “Less is more: Improved RNN-T decoding using limited label context and path merging,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021, pp. 5659–5663.
  16. “Alignment restricted streaming recurrent neural network transducer,” in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 52–59.
  17. “Contextual RNN-T for open domain ASR,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, pp. 11–15.
  18. “Scaling ASR improves zero and few shot learning,” in Interspeech 2022, 21st Annual Conference of the International Speech Communication Association, pp. 5135–5139.
  19. “Librispeech: an ASR corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, pp. 5206–5210.
  20. “Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, pp. 66–71.
  21. “ASR n-best fusion nets,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021, pp. 7618–7622.
  22. “Improving spoken language understanding by exploiting ASR n-best hypotheses,” arXiv preprint arXiv:2001.05284, 2020.
  23. “Joint contextual modeling for ASR correction and language understanding,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020, pp. 6349–6353.
  24. “Joint decoding for speech recognition and semantic tagging,” in Interspeech 2012, 21st Annual Conference of the International Speech Communication Association, pp. 1067–1070.
Citations (1)

Summary

We haven't generated a summary for this paper yet.