Massive Activations in Large Language Models (2402.17762v2)

Published 27 Feb 2024 in cs.CL and cs.LG

Abstract: We observe an empirical phenomenon in LLMs -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at https://github.com/locuslab/massive-activations.

References (55)

Citations (36)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper identifies massive activations as fixed biases vital for model performance, evidenced by performance collapse when nullified.
It systematically documents these activations across LLMs like LLaMA2 and Mixtral, highlighting their feature-independent occurrence.
The study demonstrates that explicit attention biases can replace massive activations, suggesting optimization avenues for future architectures.

Unveiling the Role of Massive Activations in LLMs

Introduction

LLMs have captured the interest of the research community for their state-of-the-art performance across a broad spectrum of natural language processing tasks. While the focus has predominantly been on improving these models' external behaviors, understanding their internal mechanisms remains equally crucial. This paper presents a comprehensive paper on a previously underexplored phenomenon within LLMs – the presence of massive activations, extremely large values within the models' hidden states that are disproportionately larger than the majority of other activations.

Existence and Properties of Massive Activations

The paper meticulously documents the occurrence of massive activations across various LLM architectures, including LLaMA2 and Mixtral models. Characterized by their sheer magnitude, often orders of magnitude larger than the median activation values, these activations are rare yet consistently observed across different models. Notably, the massive activations were found to be mostly feature and input agnostic, persisting across different inputs and located within specific feature dimensions associated with special tokens, like the starting word token.

Functional Role in LLMs

Delving deeper, the paper explores the functionality of these massive activations, revealing their pivotal role as fixed biases within the LLM architecture. This assertion was substantiated through interventions that either nullified these activations or set them to their mean values, with the former causing a catastrophic collapse in model performance and the latter having negligible impact. This strongly suggests that the massive activations act as vital, constant bias terms, intrinsic to the model's successful performance.

Impact on Attention Mechanism

An intriguing connection between massive activations and self-attention was uncovered. The paper highlights how these activations lead to a concentration of attention probabilities to their corresponding tokens. Moreover, it was demonstrated that by incorporating explicit attention biases, the need for LLMs to learn these massive activations could be circumvented, suggesting a built-in mechanism to prioritize certain tokens over others in the models' attention computation.

Extension to Vision Transformers

The phenomenon of massive activations was not limited to textual models but was also observable in Vision Transformers (ViTs), albeit less frequently. In ViTs, these activations function similarly as fixed biases, particularly prominent in later stages and specific feature dimensions. The paper also draws parallels between massive activations and the recently introduced “register tokens” in ViTs, suggesting a common underlying principle of acting as fixed biases to facilitate model computation.

Contributions and Future Directions

This work contributes significantly to the understanding of internal mechanisms of LLMs, identifying massive activations as crucial bias terms that influence both model performance and attention allocation. The paper not only elucidates the phenomenon across text and vision models but also provides a pathway toward optimizing model architecture by incorporating explicit attention biases, potentially eliminating the need for these internal massive activations.

The implications of this research are vast, opening new avenues for more efficient model designs and a deeper comprehension of the underlying operations of current LLMs. Future work can explore broader model families and applications, further refining our understanding of these foundational AI models.

Conclusion

Understanding the internal dynamics of LLMs, including phenomena like massive activations, is key to unlocking their potential and guiding the development of next-generation AI systems. This paper takes a significant step forward, offering insights into the pivotal roles these activations play within models’ architectures and how they can be harnessed or optimized for improved performance and efficiency.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (4)

GitHub

GitHub - locuslab/massive-activations: Code accompanying the paper "Massive Activations in Large Language Models" (77 stars)

Tweets

https://twitter.com/dvruette/status/1953409957992264090

https://twitter.com/liuzhuang1234/status/1762678569568809299

https://twitter.com/gaur_manu/status/1916709075116577199

https://twitter.com/main_horse/status/1951212852079206441

https://twitter.com/_mingjiesun/status/1886207064235504058

https://twitter.com/fly51fly/status/1762834052879663300