Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 97 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 100 tok/s Pro

GPT OSS 120B 464 tok/s Pro

Kimi K2 186 tok/s Pro

2000 character limit reached

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding (2307.07421v3)

Published 12 Jul 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption. Cheaper alternatives to self-attention for ASR have been developed, but they fail to consistently reach the same level of accuracy. This paper, therefore, proposes a novel linear-time alternative to self-attention. It summarises an utterance with the mean over vectors for all time steps. This single summary is then combined with time-specific information. We call this method "SummaryMixing". Introducing SummaryMixing in state-of-the-art ASR models makes it feasible to preserve or exceed previous speech recognition performance while making training and inference up to 28% faster and reducing memory use by half.

References (42)

Citations (6)

View on Semantic Scholar

Collections

Summary

The paper introduces SummaryMixing, a method that reduces computational complexity from quadratic to linear by using a global summary vector for token mixing.
It demonstrates up to 28% reduction in training and inference times and halves memory usage compared to conventional multi-head self-attention in ASR systems.
The approach maintains ASR accuracy across diverse datasets, paving the way for efficient speech recognition on resource-constrained devices.

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

The paper describes a novel approach, termed "SummaryMixing," which serves as a linear-complexity alternative to multi-head self-attention (MHSA) in automatic speech recognition (ASR) systems. This method addresses the inherent computational inefficiency of self-attention, which scales quadratically with the input sequence length, posing significant challenges in terms of training time and memory consumption, especially for long sequences and resource-constrained environments.

Key Contributions

The research introduces SummaryMixing, which effectively condenses information from an entire speech utterance into a single summary vector, obtained by averaging contributions across all time steps. This summary is then integrated with time-specific information to produce the final output. By doing so, SummaryMixing manages to reduce computational complexity from quadratic to linear with respect to the sequence length, offering substantial improvements in terms of training and inference speeds as well as memory requirements.

SummaryMixing was experimented with state-of-the-art ASR models and reported substantial reductions in resource usage: training and inference times were reduced by up to 28%, and memory usage was halved compared to models utilizing MHSA, without degrading the accuracy of ASR systems across five datasets of varying linguistic and acoustic conditions.

Comparative Analysis with Existing Work

The paper situates SummaryMixing within the broader literature surrounding efficient alternatives to self-attention, such as low-rank approximations, linearization techniques, and sparsification methods. However, existing approaches have not been able to consistently match the performance of self-attention-equipped systems in ASR contexts.

SummaryMixing draws inspiration from previous works suggesting that self-attention's pairwise operations might act similarly to simple linear operations under certain circumstances. Moreover, the method leverages insights from the HyperMixer framework, extending concepts from the MLP Mixer to handle variable-length sequence processing more effectively. The comparison with established models like Fastformer indicates that SummaryMixing is among the most effective linear alternatives for token mixing in speech-processing models.

Implications and Future Directions

The implications of adopting SummaryMixing in speech recognition are significant. By achieving comparable or superior performance with lower computational demands, SummaryMixing provides a pathway to deploy efficient ASR models on edge devices where computational resources are limited. Additionally, the methodology can be adapted to other speech-processing tasks like spoken language understanding (SLU) and keyword spotting with promising results.

From a theoretical perspective, the work challenges the prevalent view that MHSA is indispensable for capturing complex interactions in ASR systems. The empirical evidence supporting the utility of the global summary vector suggests that much of the essential information for high-level acoustic modeling can be concentrated into more compact representations.

Future research may focus on refining the mathematical and architectural frameworks of SummaryMixing to further enhance its generalizability and robustness across diverse applications in natural language processing and speech understanding. Additionally, investigating the applicability of SummaryMixing to multi-modal and multi-task learning settings could present new opportunities to optimize models for a broader array of input types while maintaining scalable performance.

In conclusion, SummaryMixing offers a compelling, low-complexity alternative to self-attention mechanisms in state-of-the-art speech recognition models, paving the way for more efficient, accessible, and environmentally sustainable AI technologies in speech and language processing.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (4)

GitHub

GitHub - SamsungLabs/SummaryMixing: This repository implements SummaryMixing, a simpler, faster and much cheaper replacement to self-attention for automatic speech recognition (see: https://arxiv.org/abs/2307.07421). The code is ready to be used with the SpeechBrain toolkit). (117 stars)

Tweets

https://twitter.com/ParcolletT/status/1747955979193319734

https://twitter.com/ArxivSound/status/1747819570511888690