Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models (2312.08553v3)

Published 13 Dec 2023 in eess.AS and cs.SD

Abstract: End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. “An overview of end-to-end automatic speech recognition,” Symmetry, 2019.
  2. “Deep speech: Scaling up end-to-end speech recognition,” arXiv:1412.5567, 2014.
  3. A. Graves, “Sequence transduction with recurrent neural networks,” arXiv:1211.3711, 2012.
  4. “Attention-based models for speech recognition,” in ICONIP, 2015.
  5. “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in ICASSP, 2018.
  6. “On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition,” in Interspeech, 2020.
  7. “Streaming end-to-end speech recognition for mobile devices,” in ICASSP, 2019.
  8. “State-of-the-art Speech Recognition With Sequence-to-Sequence Models,” in ICASSP, 2018.
  9. “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017.
  10. “wav2vec: Unsupervised pre-training for speech recognition,” arXiv:1904.05862, 2019.
  11. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” NeurIPS, 2020.
  12. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, 2021.
  13. “Self-supervised learning with random-projection quantizer for speech recognition,” in ICML, 2022.
  14. “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
  15. “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv:2303.01037, 2023.
  16. “Scaling speech technology to 1,000+ languages,” arXiv:2305.13516, 2023.
  17. R. Takeda et al., “Node pruning based on entropy of weights and node activity for small-footprint acoustic model based on deep neural networks.,” in Interspeech, 2017.
  18. “Optimizing speech recognition for the edge,” arXiv:1909.12408, 2019.
  19. “Rethinking pruning for accelerating deep inference at the edge,” in SIGKDD, 2020.
  20. “Parp: Prune, adjust and re-prune for self-supervised speech recognition,” NeurIPS, 2021.
  21. “Audio lottery: Speech recognition made ultra-lightweight, noise-robust, and transferable,” in ICLR, 2021.
  22. “4-bit quantization of lstm-based speech recognition models,” arXiv:2108.12074, 2021.
  23. “A simplified fully quantized transformer for end-to-end speech recognition,” arXiv:1911.03604, 2019.
  24. “4-bit conformer with native quantization aware training for speech recognition,” arXiv:2203.15952, 2022.
  25. “2-bit conformer quantization for automatic speech recognition,” arXiv:2305.16619, 2023.
  26. “Learning n: m fine-grained structured sparse neural networks from scratch,” arXiv:2102.04010, 2021.
  27. “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv:1308.3432, 2013.
  28. “Dphubert: Joint distillation and pruning of self-supervised speech models,” arXiv:2305.17651, 2023.
  29. “Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised learning,” arXiv:2207.00555, 2022.
  30. “Recycle-and-distill: Universal compression strategy for transformer-based speech ssl models with attention map reusing and masking distillation,” arXiv:2305.11685, 2023.
  31. “Conformer: Convolution-augmented transformer for speech recognition,” Interspeech, pp. 5036–5040, 2020.
  32. “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv:1901.02860, 2019.
  33. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
  34. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
  35. “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in CVPR, 2018.
  36. Google, “Artificial Intelligence at Google: Our Principles,” .
  37. “Pseudo label is better than human label,” in Interspeech, 2022.
  38. “https://cloud.google.com/tpu/docs/supported-tpu-configurations,” .
  39. “Attention is all you need,” NeurIPS, 2017.
  40. A. Graves, “Practical variational inference for neural networks,” NeurIPS, 2011.
  41. “Rand: Robustness aware norm decay for quantized seq2seq models,” arXiv:2305.15536, 2023.
  42. “Learned step size quantization,” arXiv:1902.08153, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Shaojin Ding (12 papers)
  2. David Rim (4 papers)
  3. Yanzhang He (41 papers)
  4. Oleg Rybakov (15 papers)
  5. Bo Li (1107 papers)
  6. Rohit Prabhavalkar (59 papers)
  7. Weiran Wang (65 papers)
  8. Tara N. Sainath (79 papers)
  9. Shivani Agrawal (11 papers)
  10. Zhonglin Han (3 papers)
  11. Jian Li (667 papers)
  12. Amir Yazdanbakhsh (38 papers)
  13. David Qiu (12 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com