When Attention Sink Emerges in Language Models: An Empirical View (2410.10781v1)

Published 14 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens' inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.

PDF HTML Abstract

Overview of "When Attention Sink Emerges in LLMs: An Empirical View"

The paper "When Attention Sink Emerges in LLMs: An Empirical View" provides a comprehensive paper of the attention sink phenomenon in LLMs (LMs). This paper addresses the understanding of how attention sinks are ubiquitously present in LMs across varying inputs and model sizes, and the implications of this phenomenon during the LM pre-training phase.

Key Contributions

The authors identify that attention sinks universally appear in LMs, focusing predominantly on the first token in sequences, regardless of the semantic significance of this token. They further explore various factors influencing the emergence of attention sinks during LM pre-training, such as optimization techniques, data distribution, loss functions, and model architecture. The paper presents several critical insights:

Universality of Attention Sinks: It is demonstrated that attention sinks inherently exist in both small and large LMs, across different data inputs, highlighting their universal relevance in LM behaviors.
Influential Factors: Through rigorous experimentation, the authors establish that attention sinks emerge primarily after effective training on sufficient data. The phenomenon correlates with the configurations of loss functions and data distribution. Optimization techniques, particularly learning rate schedules and weight decay, subtly modify the attention sink's prominence.
Role of Sink Tokens: The findings suggest that attention sinks function akin to key biases, holding excess attention scores, which may not meaningfully contribute to value computations. Moreover, the paper implicates softmax normalization's inner dependencies in fostering this phenomenon, as replacing softmax with alternative operations like sigmoid can mitigate the emergence of attention sinks in models with up to one billion parameters.
Architectural Influence: Various PE mechanisms and transformer block strategies (pre-norm vs. post-norm) were examined. The presence of attention sinks tends to be indifferent to the type of positional embeddings used, be it rotary, absolute, or learnable embeddings.

Implications and Future Work

This research paves the way for deeper theoretical and practical understanding of how LMs allocate attention, particularly questioning the design paradigms of attention mechanisms. By dissecting attention sinks, the paper prompts a reevaluation of how LMs optimize information processing and highlights potential inefficiencies. It indicates that biases introduced through attention sinks, unless deliberately managed or mitigated, could misguide learning dynamics and affect interpretability.

The empirical insights may spur advancements in optimizing training strategies for LMs, potentially reducing resource allocation to non-informative tokens. Furthermore, relaxing dependency constraints via softmax alternatives might enhance model generalization and robustness.

Future work could expand to examine attention sinks associated with specific tokens beyond the first position, such as punctuation or common syntactic markers, and assess their overlap with attention sink mechanisms identified in this paper. Concurrently, continuing exploration into alternative attention operations without normalization may yield novel architectures alleviating implicit biases within LMs.

This paper provides a foundational understanding of attention sinks, setting the stage for future research and improvements in the training and operational efficiency of LMs, with implications for broader applications in AI models.