Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free (2505.06708v1)

Published 10 May 2025 in cs.CL

Abstract: Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{https://github.com/qiuzh20/gated_attention}{codes}$ and $\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.

Authors (13)

Zihan Qiu (19 papers)
Zekun Wang (50 papers)
Bo Zheng (205 papers)
Zeyu Huang (31 papers)
Kaiyue Wen (18 papers)
Songlin Yang (42 papers)
Rui Men (21 papers)
Le Yu (41 papers)
Fei Huang (409 papers)
Suozhi Huang (4 papers)
Dayiheng Liu (75 papers)
Jingren Zhou (198 papers)
Junyang Lin (99 papers)

Summary

An Examination of Gated Attention for LLMs

The paper "Gated Attention for LLMs: Non-linearity, Sparsity, and Attention-Sink-Free" presents a nuanced exploration of gated mechanisms within softmax attention layers and their implications on LLM performance and training dynamics. Leveraging rigorous experiments involving extensive variants of Mixture-of-Experts (MoE) models and dense architectures, this research offers valuable insights into how gated attention affects learning processes, stability, and model scaling.

Overview of Gated Mechanisms in Attention

The paper focuses on augmenting traditional softmax attention mechanisms with gating techniques. The authors employ head-specific sigmoid gates after Scaled Dot-Product Attention (SDPA) in both MoE and dense models, observing notable enhancements across multiple dimensions, such as perplexity reduction and improved generalization in long-context settings. The deployment of gated mechanisms introduces non-linearity and sparsity into the attention framework, addressing issues such as the attention sink phenomenon.

Key Empirical Findings

Performance Improvements: By applying gating, especially at the SDPA output, models show improved perplexity levels and benchmark scores compared to baseline approaches. The experimental validation across 30 model variants corroborates the benefits of gated attention in modern architectures.
Training Stability: The introduction of gated mechanisms substantially mitigates training instabilities, often encountered with larger learning rates and batch sizes. This stabilization allows models to tolerate increased training hyperparameters, suggesting practical implications for efficiently scaling large models.
Non-Linearity and Sparsity: Non-linearity induced by gating elevates the expressive capabilities of low-rank mappings between attention layers. The gating strategy enforces sparsity by judiciously filtering attention outputs based on token relevance, effectively curtailing attention sinks.
Extended Context Performance: Sparse gating also enhances the model's ability to generalize across extended context lengths, as indicated by performance on sequence tasks such as RULER benchmark testing at longer sequence lengths (up to 128k tokens).

Analytical Insights

The paper explores the mechanistic subtleties of gating in softmax attention layers, attributing performance gains to two principal factors: increased non-linearity and the sparsity of gating scores. The research confirms that implementing head-specific gating scores is crucial for optimizing performance since different attention heads capture diverse input features. Furthermore, sparse gating proves beneficial by dynamically adapting context information to specific tokens, mitigating uniform attention bias across sequences, and supporting efficient long-context processing.

Future Directions

The release of attention-sink-free models not only represents a technical contribution to the open-source ecosystem but also signals potential paths for future exploration. The broader implications of gating mechanisms in enhancing transformer's adaptability to scale, and generalization in autoregressive tasks warrant further investigation. Continued advancement may explore hybrid strategies integrating gating with alternate architectural innovations to further refine model efficiency and accuracy across diverse applications.

In conclusion, the paper advances discourse on the functional role of gating mechanisms within neural architectures, contributing both theoretical insights and practical tools for enriching the design of next-generation LLMs. The systematic evaluation and open sourcing of models provide a foundation for subsequent scholars to build upon, further evolving the understanding of gated attention dynamics in deep learning systems.

Related Papers

Find Related Papers

Tweets

https://twitter.com/k7agar/status/1922423674638106839

https://twitter.com/GptMaestro/status/1927752170461372486

YouTube

Show All Videos