Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 TPS
Gemini 2.5 Pro 50 TPS Pro
GPT-5 Medium 32 TPS
GPT-5 High 30 TPS Pro
GPT-4o 67 TPS
GPT OSS 120B 452 TPS Pro
Kimi K2 190 TPS Pro
2000 character limit reached

Massive Activations in Large Language Models (2402.17762v2)

Published 27 Feb 2024 in cs.CL and cs.LG

Abstract: We observe an empirical phenomenon in LLMs -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at https://github.com/locuslab/massive-activations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Intriguing properties of quantization at scale. In NeurIPS, 2023.
  2. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  3. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  4. Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641, 2019.
  5. Understanding and overcoming the challenges of efficient transformer quantization. arXiv:2109.12948, 2021.
  6. Quantizable transformers: Removing outliers by helping attention heads do nothing. arXiv preprint arXiv:2306.12929, 2023.
  7. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  8. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  9. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019a.
  10. What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019b.
  11. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  12. Vision transformers need registers. arXiv:2309.16588, 2023.
  13. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  14. LLM.int8(): 8-bit matrix multiplication for transformers at scale. In NeurIPS, 2022.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  17. Miller Evan. Attention is off by one, 2023. URL https://www.evanmiller.org/attention-is-off-by-one.html.
  18. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2021.
  19. Deep residual learning for image recognition. In CVPR, 2016.
  20. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
  21. Residual stream norms grow exponentially over the forward pass, 2023. URL https://www.alignmentforum.org/posts/8mizBCm3dyc432nK8/residual-stream-norms-grow-exponentially-over-the-forward.
  22. Measuring massive multitask language understanding. In ICLR, 2021.
  23. Phi-2: The surprising power of small language models, 2023. URL https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
  24. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  25. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  26. Andrej Karpathy. Nanogpt, 2023. URL https://github.com/karpathy/nanoGPT.
  27. Gpt-4 passes the bar exam. SSRN, 2023.
  28. Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593, 2019.
  29. Bert busters: Outlier dimensions that disrupt transformers. In ACL Findings, 2021.
  30. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  31. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  32. MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b.
  33. Role of bias terms in dot-product attention. arXiv preprint arXiv:2302.08626, 2023.
  34. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  35. Dinov2: Learning robust visual features without supervision. arXiv:2304.07193, 2024.
  36. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  37. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  38. Language models are unsupervised multitask learners. Technical Report, 2019.
  39. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  40. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507.
  41. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
  42. A sparse null code emerges in deep neural networks. In NeurIPS UniReps Workshop, 2023.
  43. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
  44. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  45. All bark and no bite: Rogue dimensions in transformer language models obscure representational quality. arXiv:2109.04404, 2021.
  46. Together Computer. Redpajama: an open dataset for training large language models, October 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  47. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  48. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022.
  49. Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023.
  50. Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML, 2023a.
  51. Efficient streaming language models with attention sinks. arXiv:2309.17453, 2023b.
  52. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  53. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  54. Root mean square layer normalization. In NeurIPS, 2019.
  55. Unveiling a core linguistic region in large language models. arXiv:2310.14928, 2023.
Citations (36)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper identifies massive activations as fixed biases vital for model performance, evidenced by performance collapse when nullified.
  • It systematically documents these activations across LLMs like LLaMA2 and Mixtral, highlighting their feature-independent occurrence.
  • The study demonstrates that explicit attention biases can replace massive activations, suggesting optimization avenues for future architectures.

Unveiling the Role of Massive Activations in LLMs

Introduction

LLMs have captured the interest of the research community for their state-of-the-art performance across a broad spectrum of natural language processing tasks. While the focus has predominantly been on improving these models' external behaviors, understanding their internal mechanisms remains equally crucial. This paper presents a comprehensive paper on a previously underexplored phenomenon within LLMs – the presence of massive activations, extremely large values within the models' hidden states that are disproportionately larger than the majority of other activations.

Existence and Properties of Massive Activations

The paper meticulously documents the occurrence of massive activations across various LLM architectures, including LLaMA2 and Mixtral models. Characterized by their sheer magnitude, often orders of magnitude larger than the median activation values, these activations are rare yet consistently observed across different models. Notably, the massive activations were found to be mostly feature and input agnostic, persisting across different inputs and located within specific feature dimensions associated with special tokens, like the starting word token.

Functional Role in LLMs

Delving deeper, the paper explores the functionality of these massive activations, revealing their pivotal role as fixed biases within the LLM architecture. This assertion was substantiated through interventions that either nullified these activations or set them to their mean values, with the former causing a catastrophic collapse in model performance and the latter having negligible impact. This strongly suggests that the massive activations act as vital, constant bias terms, intrinsic to the model's successful performance.

Impact on Attention Mechanism

An intriguing connection between massive activations and self-attention was uncovered. The paper highlights how these activations lead to a concentration of attention probabilities to their corresponding tokens. Moreover, it was demonstrated that by incorporating explicit attention biases, the need for LLMs to learn these massive activations could be circumvented, suggesting a built-in mechanism to prioritize certain tokens over others in the models' attention computation.

Extension to Vision Transformers

The phenomenon of massive activations was not limited to textual models but was also observable in Vision Transformers (ViTs), albeit less frequently. In ViTs, these activations function similarly as fixed biases, particularly prominent in later stages and specific feature dimensions. The paper also draws parallels between massive activations and the recently introduced “register tokens” in ViTs, suggesting a common underlying principle of acting as fixed biases to facilitate model computation.

Contributions and Future Directions

This work contributes significantly to the understanding of internal mechanisms of LLMs, identifying massive activations as crucial bias terms that influence both model performance and attention allocation. The paper not only elucidates the phenomenon across text and vision models but also provides a pathway toward optimizing model architecture by incorporating explicit attention biases, potentially eliminating the need for these internal massive activations.

The implications of this research are vast, opening new avenues for more efficient model designs and a deeper comprehension of the underlying operations of current LLMs. Future work can explore broader model families and applications, further refining our understanding of these foundational AI models.

Conclusion

Understanding the internal dynamics of LLMs, including phenomena like massive activations, is key to unlocking their potential and guiding the development of next-generation AI systems. This paper takes a significant step forward, offering insights into the pivotal roles these activations play within models’ architectures and how they can be harnessed or optimized for improved performance and efficiency.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.