Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition (2309.07988v3)

Published 14 Sep 2023 in cs.LG, cs.AR, cs.SD, and eess.AS

Abstract: Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer inference, typically for long-context applications, center on simplifying attention score calculations. However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage. To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. Experiments on on-device Transformer-based streaming speech recognition models show that folding attention reduces model size (and corresponding memory consumption) by up to 24% and power consumption by up to 23%, all without compromising model accuracy or computation overhead.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Attention Is All You Need,” in NeurIPS, 2017.
  2. “Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition,” in ICASSP, 2018.
  3. “A Comparative Study on Transformer vs RNN in Speech Applications,” in ASRU, 2019.
  4. “Self-Attentional Acoustic Models,” in INTERSPEECH, 2018.
  5. “Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese,” in INTERSPEECH, 2018.
  6. “Low Latency End-to-End Streaming Speech Recognition with a Scout Network,” in INTERSPEECH, 2020.
  7. “Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition,” in ICASSP, 2021.
  8. “Conformer: Convolution-augmented Transformer for Speech Recognition,” in INTERSPEECH, 2020.
  9. “Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces,” in INTERSPEECH, 2020.
  10. “Transformer-Transducer: End-to-End Speech Recognition with Self-Attention,” arXiv preprint arXiv:1910.12977, 2019.
  11. “Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition,” in ICASSP, 2019.
  12. “A Time-Restricted Self-Attention Layer for ASR,” in ICASSP, 2018.
  13. “Transformer-based Acoustic Modeling for Hybrid Speech Recognition,” in ICASSP, 2020.
  14. “Generating Long Sequences with Sparse Transformers,” arXiv preprint arXiv:1904.10509, 2019.
  15. “Blockwise Self-Attention for Long Document Understanding,” in Findings of the Association for Computational Linguistics: EMNLP, 2020.
  16. “Linformer: Self-Attention with Linear Complexity,” arXiv preprint arXiv:2006.04768, 2020.
  17. “Retentive Network: A Successor to Transformer for Large Language Models,” arXiv preprint arXiv:2307.08621, 2023.
  18. “Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping,” ICASSP, 2019.
  19. “Streaming Automatic Speech Recognition with the Transformer Model,” arXiv preprint arXiv:2001.02674, 2020.
  20. “On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators,” in DAC, 2019.
  21. “UNPU: A 50.6 TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight Bit-Precision,” in ISSCC, 2018.
  22. “Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers,” in ICASSP, 2023.
  23. “LibriSpeech: An ASR Corpus based on Public Domain Audio Books,” in ICASSP, 2015.
  24. Noam Shazeer, “Fast Transformer Decoding: One Write-Head Is All You Need,” arXiv preprint arXiv:1911.02150, 2019.
  25. “Boosting the Throughput and Accelerator Utilization of Specialized CNN Inference beyond Increasing Batch Size,” in ICML, 2021.
  26. François Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” in CVPR, 2017.
  27. “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing,” in EMNLP, 2018.
  28. “Audio Augmentation for Speech Recognition,” in INTERSPEECH, 2015.
Citations (2)

Summary

We haven't generated a summary for this paper yet.