Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition (2309.07988v3)
Abstract: Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer inference, typically for long-context applications, center on simplifying attention score calculations. However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage. To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. Experiments on on-device Transformer-based streaming speech recognition models show that folding attention reduces model size (and corresponding memory consumption) by up to 24% and power consumption by up to 23%, all without compromising model accuracy or computation overhead.
- “Attention Is All You Need,” in NeurIPS, 2017.
- “Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition,” in ICASSP, 2018.
- “A Comparative Study on Transformer vs RNN in Speech Applications,” in ASRU, 2019.
- “Self-Attentional Acoustic Models,” in INTERSPEECH, 2018.
- “Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese,” in INTERSPEECH, 2018.
- “Low Latency End-to-End Streaming Speech Recognition with a Scout Network,” in INTERSPEECH, 2020.
- “Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition,” in ICASSP, 2021.
- “Conformer: Convolution-augmented Transformer for Speech Recognition,” in INTERSPEECH, 2020.
- “Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces,” in INTERSPEECH, 2020.
- “Transformer-Transducer: End-to-End Speech Recognition with Self-Attention,” arXiv preprint arXiv:1910.12977, 2019.
- “Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition,” in ICASSP, 2019.
- “A Time-Restricted Self-Attention Layer for ASR,” in ICASSP, 2018.
- “Transformer-based Acoustic Modeling for Hybrid Speech Recognition,” in ICASSP, 2020.
- “Generating Long Sequences with Sparse Transformers,” arXiv preprint arXiv:1904.10509, 2019.
- “Blockwise Self-Attention for Long Document Understanding,” in Findings of the Association for Computational Linguistics: EMNLP, 2020.
- “Linformer: Self-Attention with Linear Complexity,” arXiv preprint arXiv:2006.04768, 2020.
- “Retentive Network: A Successor to Transformer for Large Language Models,” arXiv preprint arXiv:2307.08621, 2023.
- “Self-Attention Aligner: A Latency-Control End-to-End Model for ASR Using Self-Attention Network and Chunk-Hopping,” ICASSP, 2019.
- “Streaming Automatic Speech Recognition with the Transformer Model,” arXiv preprint arXiv:2001.02674, 2020.
- “On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators,” in DAC, 2019.
- “UNPU: A 50.6 TOPS/W Unified Deep Neural Network Accelerator with 1b-to-16b Fully-Variable Weight Bit-Precision,” in ISSCC, 2018.
- “Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers,” in ICASSP, 2023.
- “LibriSpeech: An ASR Corpus based on Public Domain Audio Books,” in ICASSP, 2015.
- Noam Shazeer, “Fast Transformer Decoding: One Write-Head Is All You Need,” arXiv preprint arXiv:1911.02150, 2019.
- “Boosting the Throughput and Accelerator Utilization of Specialized CNN Inference beyond Increasing Batch Size,” in ICML, 2021.
- François Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” in CVPR, 2017.
- “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing,” in EMNLP, 2018.
- “Audio Augmentation for Speech Recognition,” in INTERSPEECH, 2015.