When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models (2404.08634v3)
Abstract: LLMs rely on the transformer architecture and its self-attention mechanism to deliver strong performance across tasks. However, we uncover a structural inefficiency in standard pre-trained decoder-style LLMs: in many of the deeper layers, attention matrices frequently collapse to near rank-one, single-column patterns. We refer to these underutilized components as lazy layers, which are redundant and computationally inefficient. To address this, we propose Inheritune, a simple and effective training recipe for building smaller, more efficient, and high performing LLMs. Inheritune initializes a compact model by inheriting the useful early layers from a larger pre-trained model, then progressively retrains and expands it. Our experiments across multiple models and datasets show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts. This approach yields compact, performant models and offers a practical path for efficient LLM compression. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune
- Lightning AI. 2023. Lit-gpt. https://github.com/Lightning-AI/lit-gpt.
- Author(s). 2023. MPT-1.3B Model: A Large-Scale Language Model. https://huggingface.co/mosaicml/mpt-1b-redpajama-200b. Accessed: 2023-02-14.
- Convex neural networks. In Advances in Neural Information Processing Systems, volume 18. MIT Press.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv: Arxiv-2304.01373.
- PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2924–2936. Association for Computational Linguistics.
- Together Computer. 2023. Redpajama-data: An open source recipe to reproduce llama training dataset.
- Jonathan Frankle and Michael Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- A framework for few-shot language model evaluation.
- Cramming: Training a language model on a single gpu in one day. International Conference on Machine Learning.
- Xinyang Geng and Hao Liu. 2023. Openllama: An open reproduction of llama.
- Olmo: Accelerating the science of language models. arXiv preprint arXiv: 2402.00838.
- Learning both weights and connections for efficient neural networks. NEURIPS.
- Measuring massive multitask language understanding. International Conference On Learning Representations.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
- Lora: Low-rank adaptation of large language models. International Conference On Learning Representations.
- Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
- Pruning’s effect on generalization through the lens of training and regularization. In Advances in Neural Information Processing Systems.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- A review of winograd schema challenge datasets and approaches. arXiv preprint arXiv: 2004.13831.
- Optimal brain damage. In Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann.
- Pruning filters for efficient convnets. ArXiv, abs/1608.08710.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv: 2309.05463.
- Train large, then compress: Rethinking model size for efficient training and inference of transformers. International Conference On Machine Learning.
- Truthfulqa: Measuring how models mimic human falsehoods. ACL.
- Tinygsm: achieving >80 arXiv preprint arXiv: 2312.09241.
- Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. arXiv preprint arXiv: 2007.08124.
- Fine-tuning language models with just forward passes. NEURIPS.
- Scaling data-constrained language models. NEURIPS.
- Tianduo Wang Peiyuan Zhang, Guangtao Zeng and Wei Lu. 2023. Tinyllama.
- Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. NEURIPS.
- Llama: Open and efficient foundation language models. ARXIV.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288.
- Asher Trockman and J. Z. Kolter. 2023. Mimetic initialization of self-attention layers. International Conference on Machine Learning.
- Well-read students learn better: On the importance of pre-training compact models. ICLR.
- Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. International Conference on Artificial Intelligence and Statistics.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. BLACKBOXNLP@EMNLP.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv: 2310.06694.
- Structured pruning learns compact and accurate models. Annual Meeting of the Association for Computational Linguistics.
- Weight selection for model initialization. In The Twelfth International Conference on Learning Representations.
- Opt: Open pre-trained transformer language models. ARXIV.ORG.