Massive Activations in Large Language Models (2402.17762v2)
Abstract: We observe an empirical phenomenon in LLMs -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at https://github.com/locuslab/massive-activations.
- Intriguing properties of quantization at scale. In NeurIPS, 2023.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Piqa: Reasoning about physical commonsense in natural language. arXiv preprint arXiv:1911.11641, 2019.
- Understanding and overcoming the challenges of efficient transformer quantization. arXiv:2109.12948, 2021.
- Quantizable transformers: Removing outliers by helping attention heads do nothing. arXiv preprint arXiv:2306.12929, 2023.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019a.
- What does bert look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341, 2019b.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Vision transformers need registers. arXiv:2309.16588, 2023.
- ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- LLM.int8(): 8-bit matrix multiplication for transformers at scale. In NeurIPS, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Miller Evan. Attention is off by one, 2023. URL https://www.evanmiller.org/attention-is-off-by-one.html.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2021.
- Deep residual learning for image recognition. In CVPR, 2016.
- Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
- Residual stream norms grow exponentially over the forward pass, 2023. URL https://www.alignmentforum.org/posts/8mizBCm3dyc432nK8/residual-stream-norms-grow-exponentially-over-the-forward.
- Measuring massive multitask language understanding. In ICLR, 2021.
- Phi-2: The surprising power of small language models, 2023. URL https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Andrej Karpathy. Nanogpt, 2023. URL https://github.com/karpathy/nanoGPT.
- Gpt-4 passes the bar exam. SSRN, 2023.
- Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593, 2019.
- Bert busters: Outlier dimensions that disrupt transformers. In ACL Findings, 2021.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL www.mosaicml.com/blog/mpt-7b.
- Role of bias terms in dot-product attention. arXiv preprint arXiv:2302.08626, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Dinov2: Learning robust visual features without supervision. arXiv:2304.07193, 2024.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Language models are unsupervised multitask learners. Technical Report, 2019.
- Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
- Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
- A sparse null code emerges in deep neural networks. In NeurIPS UniReps Workshop, 2023.
- Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- All bark and no bite: Rogue dimensions in transformer language models obscure representational quality. arXiv:2109.04404, 2021.
- Together Computer. Redpajama: an open dataset for training large language models, October 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022.
- Small-scale proxies for large-scale transformer training instabilities. arXiv preprint arXiv:2309.14322, 2023.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML, 2023a.
- Efficient streaming language models with attention sinks. arXiv:2309.17453, 2023b.
- Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Root mean square layer normalization. In NeurIPS, 2019.
- Unveiling a core linguistic region in large language models. arXiv:2310.14928, 2023.
Collections
Sign up for free to add this paper to one or more collections.