Efficient Attention Mechanisms Balancing Scalability and Accuracy

Develop attention mechanisms for Transformer-based models that simultaneously maintain scalability to long sequences (reduced computational and memory complexity) and high accuracy comparable to softmax self-attention across tasks such as NLP, vision, and generative modeling.

Background

The paper surveys efficient alternatives to quadratic self-attention, including linear and sparse attention as well as state space models. While these approaches improve scalability, they often sacrifice accuracy, especially in tasks requiring rich token interactions.

In the context of this trade-off, the authors explicitly note that achieving both scalability and accuracy within attention mechanisms remains unresolved, motivating the development of methods like MHLA that aim to restore expressivity without reintroducing quadratic complexity.

References

Despite these advances, designing efficient attention mechanisms that maintain both scalability and accuracy remains an open challenge.

MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head  (2601.07832 - Zhang et al., 12 Jan 2026) in Appendix, Section "Full Related Works", Transformer paragraph