Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models (2402.17184v1)

Published 27 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires computationally efficient strategies for decoding. In the present work, we study one such strategy: applying multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. While similar techniques have been investigated in previous work, we achieve dramatically more reduction than has previously been demonstrated through the use of multiple funnel reduction layers. Through ablations, we study the impact of various architectural choices in the encoder to identify the most effective strategies. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task, while improving encoder and decoder latencies by 48% and 92% respectively, relative to a strong but computationally expensive baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. J. Li, “Recent Advances in End-to-End Automatic Speech Recognition,” APSIPA Trans. on Signal and Information Processing, vol. 11, no. 1, Nov. 2021.
  2. “End-to-end speech recognition: A survey,” IEEE/ACM Trans. on Audio, Speech, and Lang. Proc., vol. 32, pp. 325–351, 2024.
  3. “Speech Recognition with Deep Recurrent Neural Networks,” in Proc. ICASSP, 2013.
  4. “Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss,” in Proc. ICASSP, 2020.
  5. “Attention is All You Need,” in Proc. NIPS, 2017.
  6. “Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition,” in Proc. ICASSP, 2016.
  7. “Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters,” in Proc. Interspeech, 2020.
  8. “Scaling end-to-end models for large-scale multilingual asr,” in Proc. ASRU, 2021.
  9. “Robust Speech Recognition via Large-Scale Weak Supervision,” in Proc. ICML, 2023.
  10. “Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages,” arXiv preprint arXiv:2303.01037, 2023.
  11. “Dissecting User-Perceived Latency of On-Device E2E Speech Recognition,” in Proc. Interspeech, 2021.
  12. “Hybrid Autoregressive Transducer (HAT),” in Proc. ICASSP, 2020.
  13. “Streaming End-to-End Speech Recognition for Mobile Devices,” in Proc. ICASSP, 2019.
  14. “Conformer: Convolution-Augmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020.
  15. “Funnel-Transformer: Filtering Out Sequential Redundancy for Efficient Language Processing,” in Proc. NeurIPS, 2020.
  16. “RNN-Transducer with Stateless Prediction Network,” in Proc. ICASSP, 2020.
  17. “Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging,” in Proc. ICASSP, 2021.
  18. “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Proc. NeurIPS, 2020.
  19. “A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes,” in Proc. Interspeech, 2022.
  20. “Massive End-to-End Models for Short Search Queries,” in submitted (under review), 2023.
  21. “Efficient Cascaded Streaming ASR System via Frame Rate Reduction,” in submitted (under review), 2023.
  22. G. Kurata and G. Saon, “Knowledge Distillation from Offline to Streaming RNN Transducer for End-to-End Speech Recognition,” in Proc. Interspeech, 2020.
  23. “2-bit conformer quantization for automatic speech recognition,” arXiv preprint arXiv:2305.16619, 2023.
  24. “Memory-Efficient Modeling and Search Techniques for Hardware ASR Decoders,” in Proc. Interspeech, 2016.
  25. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks,” in Proc. ICML, 2006.
  26. “Ten Lessons From Three Generations Shaped Google’s TPUv4i: Industrial Product,” in Proc. International Symposium on Computer Architecture (ISCA), 2021.
  27. “A better and faster end-to-end model for streaming asr,” in Proc. ICASSP, 2021.
  28. “Alignment-Length Synchronous Decoding for RNN Transducer,” in Proc. ICASSP, 2020.
  29. “Pseudo Label Is Better Than Human Label,” in Proc. Interspeech, 2022.
  30. “Google AI Principles,” https://blog.google/technology/ai/ai-principles/, Accessed 2023-09-01.
  31. “Improving Tail Performance of a Deliberation E2E ASR Model Using a Large Text Corpus,” in Proc. Interspeech, 2020.
  32. “Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home,” in Proc. Interspeech, 2017.
  33. “Improving Wideband Speech Recognition Using Mixed-Bandwidth Training Data in CD-DNN-HMM,” in Proc. SLT, 2012.
  34. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019.
  35. M. Schuster and K. Nakajima, “Japanese and Korean Voice Search,” in Proc. ICASSP, 2012.
  36. “Tied & Reduced RNN-T Decoder,” in Proc. Interspeech, 2021.
  37. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  38. N. Shazeer and M. Stern, “Adafactor: Adaptive Learning Rates with Sublinear Memory Cost,” in Proc. ICML, 2018.
  39. “Lingvo: A Modular and Scalable Framework for Sequence-to-Sequence Modeling,” arXiv preprint arXiv:1902.08295, 2019.
  40. “Minimum Word Error Rate Training for Attention-Based Sequence-to-Sequence Models,” in Proc. ICASSP, 2018.
Citations (8)

Summary

We haven't generated a summary for this paper yet.