Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 97 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 100 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Kimi K2 186 tok/s Pro
2000 character limit reached

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding (2307.07421v3)

Published 12 Jul 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Iot-based smart cities: A survey. In 2016 IEEE 16th international conference on environment and electrical engineering (EEEIC), pp.  1–6. IEEE, 2016.
  2. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pp.  4218–4222, 2020.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  4. Slurp: A spoken language understanding resource package. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7252–7262, 2020.
  5. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  6. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp.  1–5. IEEE, 2017.
  7. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  8–15. IEEE, 2021.
  8. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  9. Pointmixer: Mlp-mixer for point cloud understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pp.  620–640. Springer, 2022.
  10. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2978–2988, 2019.
  11. Connectionist temporal classification. Supervised sequence labelling with recurrent neural networks, pp.  61–93, 2012.
  12. Conformer: Convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, pp.  5036–5040, 2020.
  13. Recent developments on espnet toolkit boosted by conformer. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5874–5878. IEEE, 2021.
  14. Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. Proc. Interspeech 2020, pp.  3610–3614, 2020.
  15. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
  16. A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  449–456. IEEE, 2019.
  17. E-branchformer: Branchformer with enhanced merging for speech recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp.  84–91. IEEE, 2023.
  18. Squeezeformer: An efficient transformer for automatic speech recognition. Advances in Neural Information Processing Systems, 35:9361–9373, 2022.
  19. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  4835–4839. IEEE, 2017.
  20. HyperMixer: An MLP-based low cost alternative to transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  15632–15654, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.871. URL https://aclanthology.org/2023.acl-long.871.
  21. Speech recognition using deep neural networks: A systematic review. IEEE access, 7:19143–19165, 2019.
  22. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210. IEEE, 2015.
  23. The Energy and Carbon Footprint of Training End-to-End Speech Recognizers. In Proc. Interspeech 2021, pp.  4583–4587, 2021. doi: 10.21437/Interspeech.2021-456.
  24. Branchformer: Parallel MLP-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning, pp.  17627–17643. PMLR, 2022.
  25. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
  26. Speechbrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021.
  27. Enhancing the ted-lium corpus with selected data for language modeling and more ted talks. In LREC, pp.  3935–3939, 2014.
  28. Mlp-based architecture with variable length input for automatic speech recognition. OpenReview, 2021.
  29. Conformer-based speech recognition with linear nyström attention and rotary position embedding. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  8012–8016. IEEE, 2022.
  30. Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6783–6787. IEEE, 2021.
  31. Understanding the role of self attention for efficient speech recognition. In International Conference on Learning Representations, 2022.
  32. Locality matters: A locality-biased linear attention for automatic speech recognition. arXiv preprint arXiv:2203.15609, 2022.
  33. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
  34. MLP-Mixer: An all-MLP architecture for vision. In Advances in Neural Information Processing Systems, 2021.
  35. Attention is all you need. In Advances in Neural Information Processing Systems, 2017a.
  36. Attention is all you need. Advances in neural information processing systems, 30, 2017b.
  37. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  38. Fastformer: Additive attention can be all you need. arXiv preprint arXiv:2108.09084, 2021.
  39. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  40. Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455, 2022.
  41. Poolingformer: Long document modeling with pooling attention. In International Conference on Machine Learning, pp.  12437–12446. PMLR, 2021a.
  42. On the usefulness of self-attention for automatic speech recognition with transformers. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp.  89–96. IEEE, 2021b.
Citations (6)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces SummaryMixing, a method that reduces computational complexity from quadratic to linear by using a global summary vector for token mixing.
  • It demonstrates up to 28% reduction in training and inference times and halves memory usage compared to conventional multi-head self-attention in ASR systems.
  • The approach maintains ASR accuracy across diverse datasets, paving the way for efficient speech recognition on resource-constrained devices.

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

The paper describes a novel approach, termed "SummaryMixing," which serves as a linear-complexity alternative to multi-head self-attention (MHSA) in automatic speech recognition (ASR) systems. This method addresses the inherent computational inefficiency of self-attention, which scales quadratically with the input sequence length, posing significant challenges in terms of training time and memory consumption, especially for long sequences and resource-constrained environments.

Key Contributions

The research introduces SummaryMixing, which effectively condenses information from an entire speech utterance into a single summary vector, obtained by averaging contributions across all time steps. This summary is then integrated with time-specific information to produce the final output. By doing so, SummaryMixing manages to reduce computational complexity from quadratic to linear with respect to the sequence length, offering substantial improvements in terms of training and inference speeds as well as memory requirements.

SummaryMixing was experimented with state-of-the-art ASR models and reported substantial reductions in resource usage: training and inference times were reduced by up to 28%, and memory usage was halved compared to models utilizing MHSA, without degrading the accuracy of ASR systems across five datasets of varying linguistic and acoustic conditions.

Comparative Analysis with Existing Work

The paper situates SummaryMixing within the broader literature surrounding efficient alternatives to self-attention, such as low-rank approximations, linearization techniques, and sparsification methods. However, existing approaches have not been able to consistently match the performance of self-attention-equipped systems in ASR contexts.

SummaryMixing draws inspiration from previous works suggesting that self-attention's pairwise operations might act similarly to simple linear operations under certain circumstances. Moreover, the method leverages insights from the HyperMixer framework, extending concepts from the MLP Mixer to handle variable-length sequence processing more effectively. The comparison with established models like Fastformer indicates that SummaryMixing is among the most effective linear alternatives for token mixing in speech-processing models.

Implications and Future Directions

The implications of adopting SummaryMixing in speech recognition are significant. By achieving comparable or superior performance with lower computational demands, SummaryMixing provides a pathway to deploy efficient ASR models on edge devices where computational resources are limited. Additionally, the methodology can be adapted to other speech-processing tasks like spoken language understanding (SLU) and keyword spotting with promising results.

From a theoretical perspective, the work challenges the prevalent view that MHSA is indispensable for capturing complex interactions in ASR systems. The empirical evidence supporting the utility of the global summary vector suggests that much of the essential information for high-level acoustic modeling can be concentrated into more compact representations.

Future research may focus on refining the mathematical and architectural frameworks of SummaryMixing to further enhance its generalizability and robustness across diverse applications in natural language processing and speech understanding. Additionally, investigating the applicability of SummaryMixing to multi-modal and multi-task learning settings could present new opportunities to optimize models for a broader array of input types while maintaining scalable performance.

In conclusion, SummaryMixing offers a compelling, low-complexity alternative to self-attention mechanisms in state-of-the-art speech recognition models, paving the way for more efficient, accessible, and environmentally sustainable AI technologies in speech and language processing.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.