Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models (2404.08634v3)

Published 12 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs rely on the transformer architecture and its self-attention mechanism to deliver strong performance across tasks. However, we uncover a structural inefficiency in standard pre-trained decoder-style LLMs: in many of the deeper layers, attention matrices frequently collapse to near rank-one, single-column patterns. We refer to these underutilized components as lazy layers, which are redundant and computationally inefficient. To address this, we propose Inheritune, a simple and effective training recipe for building smaller, more efficient, and high performing LLMs. Inheritune initializes a compact model by inheriting the useful early layers from a larger pre-trained model, then progressively retrains and expands it. Our experiments across multiple models and datasets show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts. This approach yields compact, performant models and offers a practical path for efficient LLM compression. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Lightning AI. 2023. Lit-gpt. https://github.com/Lightning-AI/lit-gpt.
  2. Author(s). 2023. MPT-1.3B Model: A Large-Scale Language Model. https://huggingface.co/mosaicml/mpt-1b-redpajama-200b. Accessed: 2023-02-14.
  3. Convex neural networks. In Advances in Neural Information Processing Systems, volume 18. MIT Press.
  4. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv: Arxiv-2304.01373.
  5. PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press.
  6. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  8. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2924–2936. Association for Computational Linguistics.
  9. Together Computer. 2023. Redpajama-data: An open source recipe to reproduce llama training dataset.
  10. Jonathan Frankle and Michael Carbin. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  11. A framework for few-shot language model evaluation.
  12. Cramming: Training a language model on a single gpu in one day. International Conference on Machine Learning.
  13. Xinyang Geng and Hao Liu. 2023. Openllama: An open reproduction of llama.
  14. Olmo: Accelerating the science of language models. arXiv preprint arXiv: 2402.00838.
  15. Learning both weights and connections for efficient neural networks. NEURIPS.
  16. Measuring massive multitask language understanding. International Conference On Learning Representations.
  17. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  18. Lora: Low-rank adaptation of large language models. International Conference On Learning Representations.
  19. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/.
  20. Pruning’s effect on generalization through the lens of training and regularization. In Advances in Neural Information Processing Systems.
  21. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  22. A review of winograd schema challenge datasets and approaches. arXiv preprint arXiv: 2004.13831.
  23. Optimal brain damage. In Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann.
  24. Pruning filters for efficient convnets. ArXiv, abs/1608.08710.
  25. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv: 2309.05463.
  26. Train large, then compress: Rethinking model size for efficient training and inference of transformers. International Conference On Machine Learning.
  27. Truthfulqa: Measuring how models mimic human falsehoods. ACL.
  28. Tinygsm: achieving >80 arXiv preprint arXiv: 2312.09241.
  29. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. arXiv preprint arXiv: 2007.08124.
  30. Fine-tuning language models with just forward passes. NEURIPS.
  31. Scaling data-constrained language models. NEURIPS.
  32. Tianduo Wang Peiyuan Zhang, Guangtao Zeng and Wei Lu. 2023. Tinyllama.
  33. Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press.
  34. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. NEURIPS.
  35. Llama: Open and efficient foundation language models. ARXIV.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288.
  37. Asher Trockman and J. Z. Kolter. 2023. Mimetic initialization of self-attention layers. International Conference on Machine Learning.
  38. Well-read students learn better: On the importance of pre-training compact models. ICLR.
  39. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. International Conference on Artificial Intelligence and Statistics.
  40. Glue: A multi-task benchmark and analysis platform for natural language understanding. BLACKBOXNLP@EMNLP.
  41. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv: 2310.06694.
  42. Structured pruning learns compact and accurate models. Annual Meeting of the Association for Computational Linguistics.
  43. Weight selection for model initialization. In The Twelfth International Conference on Learning Representations.
  44. Opt: Open pre-trained transformer language models. ARXIV.ORG.
Citations (5)

Summary

  • The paper demonstrates that Inheritune achieves 89% of a larger model's accuracy using just 0.1% of the pre-training data on a single GPU.
  • The methodology leverages initial transformer layers from larger models to reduce data and computation requirements.
  • Results indicate that scaling inherited layers enhances performance across benchmarks, enabling efficient language model development.

Exploring Efficient Pre-Training Methods for Smaller LLMs with Inheritune

Introduction

Recent studies in the pre-training of small base LLMs (LMs) propose a nuanced method, termed Inheritune, focusing on leveraging a subset of transformer blocks from a larger LM to train a smaller model on a fraction of the original pre-training data. This paper examines the potential of Inheritune in developing compact but effective LMs with limited computational resources. It reports on experiments using a significantly smaller dataset size (0.1%) compared to the larger base model's training data, and a noteworthy reduction in the training time utilizing just a single GPU. The paper asserts the pre-trained smaller LM's competitive performance on multiple evaluation datasets and benchmarks, comparing favorably with base models of similar or larger sizes pre-trained from scratch on significantly more extensive datasets.

Method: Inheritune

Inheritune proposes an efficient approach for crafting smaller base LMs from larger reference models when only a small portion of the pre-training data is publicly available. Key steps include the inheritance of the first few transformer layers of a larger pre-trained model and further training the smaller model on a much smaller dataset. This method significantly reduces the compute and data requirements. The paper details implementing Inheritune using various reference models and data regimes, showing its versatility and effectiveness across different settings.

Results: Inheritune with 1B Data

Utilizing just 1B tokens for pre-training, Inheritune demonstrates a smaller base LM's ability to achieve considerable performance metrics across diverse evaluation datasets. Notably, this model achieves 89% of the downstream accuracy of its reference model on various tasks, despite the reference being double in size and trained on 1000 times more data. These findings underscore Inheritune's computational efficiency and potential in developing performant base models under stringent data and compute constraints.

Scaling Across Different Model Sizes

Inheritune's scalability is tested through the development of various small base LMs, derived from the same large base model but varying in size. Results indicate a positive relationship between the number of inherited transformer layers and model performance on the MMLU benchmark, highlighting Inheritune's adaptability to craft smaller LMs of varying capacities while maintaining competitive performance.

Additional Analysis with Larger Reference LMs and 50B Data

Extending the analysis to scenarios with more available data (50B tokens) and larger reference models (up to 7B parameters), the findings suggest enhanced performance of the smaller LMs. This extension solidifies Inheritune's applicability and effectiveness in a broader range of scenarios, showcasing improvements in model performance with increased data access and leveraging larger reference models.

Exploratory Analysis in the Presence of Full Pre-Training Data

In scenarios where the complete pre-training dataset is available, Inheritune's methodology exhibits the potential to match or exceed the performance of the larger reference model with a significantly smaller model. This section reaffirms the utility of Inheritune in efficiently reducing model size without compromising on validation loss, offering a pragmatic solution for situations where computational resources are limited but full pre-training data is accessible.

Implications

The Inheritune methodology presents an economic and computationally efficient pathway for the development of small base LMs, challenging the normative approaches that rely heavily on large datasets and extensive computational resources. It proposes a robust baseline for future pre-training endeavors aimed at developing smaller model variants and elucidates the notion of "sufficient depth," contributing to more thoughtful architectural decisions in LLM development.

Conclusion

Inheritune introduces a remarkably efficient approach for developing small base LMs through strategic inheritance of transformer blocks and smart utilization of limited data resources. Its success across various settings and model sizes emphasizes the potential to democratize access to performant LMs, paving the way for broader experimentation and innovation within the field of AI and natural language processing.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com