SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding (2307.07421v3)
Abstract: Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.
- Iot-based smart cities: A survey. In 2016 IEEE 16th international conference on environment and electrical engineering (EEEIC), pp. 1–6. IEEE, 2016.
- Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 4218–4222, 2020.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Slurp: A spoken language understanding resource package. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7252–7262, 2020.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA), pp. 1–5. IEEE, 2017.
- Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 8–15. IEEE, 2021.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- Pointmixer: Mlp-mixer for point cloud understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII, pp. 620–640. Springer, 2022.
- Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, 2019.
- Connectionist temporal classification. Supervised sequence labelling with recurrent neural networks, pp. 61–93, 2012.
- Conformer: Convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, pp. 5036–5040, 2020.
- Recent developments on espnet toolkit boosted by conformer. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5874–5878. IEEE, 2021.
- Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. Proc. Interspeech 2020, pp. 3610–3614, 2020.
- A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019.
- A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 449–456. IEEE, 2019.
- E-branchformer: Branchformer with enhanced merging for speech recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 84–91. IEEE, 2023.
- Squeezeformer: An efficient transformer for automatic speech recognition. Advances in Neural Information Processing Systems, 35:9361–9373, 2022.
- Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4835–4839. IEEE, 2017.
- HyperMixer: An MLP-based low cost alternative to transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15632–15654, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.871. URL https://aclanthology.org/2023.acl-long.871.
- Speech recognition using deep neural networks: A systematic review. IEEE access, 7:19143–19165, 2019.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. IEEE, 2015.
- The Energy and Carbon Footprint of Training End-to-End Speech Recognizers. In Proc. Interspeech 2021, pp. 4583–4587, 2021. doi: 10.21437/Interspeech.2021-456.
- Branchformer: Parallel MLP-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning, pp. 17627–17643. PMLR, 2022.
- Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
- Speechbrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021.
- Enhancing the ted-lium corpus with selected data for language modeling and more ted talks. In LREC, pp. 3935–3939, 2014.
- Mlp-based architecture with variable length input for automatic speech recognition. OpenReview, 2021.
- Conformer-based speech recognition with linear nyström attention and rotary position embedding. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8012–8016. IEEE, 2022.
- Emformer: Efficient memory transformer based acoustic model for low latency streaming speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6783–6787. IEEE, 2021.
- Understanding the role of self attention for efficient speech recognition. In International Conference on Learning Representations, 2022.
- Locality matters: A locality-biased linear attention for automatic speech recognition. arXiv preprint arXiv:2203.15609, 2022.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28, 2022.
- MLP-Mixer: An all-MLP architecture for vision. In Advances in Neural Information Processing Systems, 2021.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017a.
- Attention is all you need. Advances in neural information processing systems, 30, 2017b.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
- Fastformer: Additive attention can be all you need. arXiv preprint arXiv:2108.09084, 2021.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Wenet 2.0: More productive end-to-end speech recognition toolkit. arXiv preprint arXiv:2203.15455, 2022.
- Poolingformer: Long document modeling with pooling attention. In International Conference on Machine Learning, pp. 12437–12446. PMLR, 2021a.
- On the usefulness of self-attention for automatic speech recognition with transformers. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 89–96. IEEE, 2021b.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.