When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models (2404.08634v3)

Published 12 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs rely on the transformer architecture and its self-attention mechanism to deliver strong performance across tasks. However, we uncover a structural inefficiency in standard pre-trained decoder-style LLMs: in many of the deeper layers, attention matrices frequently collapse to near rank-one, single-column patterns. We refer to these underutilized components as lazy layers, which are redundant and computationally inefficient. To address this, we propose Inheritune, a simple and effective training recipe for building smaller, more efficient, and high performing LLMs. Inheritune initializes a compact model by inheriting the useful early layers from a larger pre-trained model, then progressively retrains and expands it. Our experiments across multiple models and datasets show that Inheritune trained models, despite having significantly fewer layers, can match or even outperform their larger counterparts. This approach yields compact, performant models and offers a practical path for efficient LLM compression. Code is available at https://github.com/sanyalsunny111/LLM-Inheritune

References (44)

Citations (5)

View on Semantic Scholar

Summary

The paper demonstrates that Inheritune achieves 89% of a larger model's accuracy using just 0.1% of the pre-training data on a single GPU.
The methodology leverages initial transformer layers from larger models to reduce data and computation requirements.
Results indicate that scaling inherited layers enhances performance across benchmarks, enabling efficient language model development.

Exploring Efficient Pre-Training Methods for Smaller LLMs with Inheritune

Introduction

Recent studies in the pre-training of small base LLMs (LMs) propose a nuanced method, termed Inheritune, focusing on leveraging a subset of transformer blocks from a larger LM to train a smaller model on a fraction of the original pre-training data. This paper examines the potential of Inheritune in developing compact but effective LMs with limited computational resources. It reports on experiments using a significantly smaller dataset size (0.1%) compared to the larger base model's training data, and a noteworthy reduction in the training time utilizing just a single GPU. The paper asserts the pre-trained smaller LM's competitive performance on multiple evaluation datasets and benchmarks, comparing favorably with base models of similar or larger sizes pre-trained from scratch on significantly more extensive datasets.

Method: Inheritune

Inheritune proposes an efficient approach for crafting smaller base LMs from larger reference models when only a small portion of the pre-training data is publicly available. Key steps include the inheritance of the first few transformer layers of a larger pre-trained model and further training the smaller model on a much smaller dataset. This method significantly reduces the compute and data requirements. The paper details implementing Inheritune using various reference models and data regimes, showing its versatility and effectiveness across different settings.

Results: Inheritune with 1B Data

Utilizing just 1B tokens for pre-training, Inheritune demonstrates a smaller base LM's ability to achieve considerable performance metrics across diverse evaluation datasets. Notably, this model achieves 89% of the downstream accuracy of its reference model on various tasks, despite the reference being double in size and trained on 1000 times more data. These findings underscore Inheritune's computational efficiency and potential in developing performant base models under stringent data and compute constraints.

Scaling Across Different Model Sizes

Inheritune's scalability is tested through the development of various small base LMs, derived from the same large base model but varying in size. Results indicate a positive relationship between the number of inherited transformer layers and model performance on the MMLU benchmark, highlighting Inheritune's adaptability to craft smaller LMs of varying capacities while maintaining competitive performance.

Additional Analysis with Larger Reference LMs and 50B Data

Extending the analysis to scenarios with more available data (50B tokens) and larger reference models (up to 7B parameters), the findings suggest enhanced performance of the smaller LMs. This extension solidifies Inheritune's applicability and effectiveness in a broader range of scenarios, showcasing improvements in model performance with increased data access and leveraging larger reference models.

Exploratory Analysis in the Presence of Full Pre-Training Data

In scenarios where the complete pre-training dataset is available, Inheritune's methodology exhibits the potential to match or exceed the performance of the larger reference model with a significantly smaller model. This section reaffirms the utility of Inheritune in efficiently reducing model size without compromising on validation loss, offering a pragmatic solution for situations where computational resources are limited but full pre-training data is accessible.

Implications

The Inheritune methodology presents an economic and computationally efficient pathway for the development of small base LMs, challenging the normative approaches that rely heavily on large datasets and extensive computational resources. It proposes a robust baseline for future pre-training endeavors aimed at developing smaller model variants and elucidates the notion of "sufficient depth," contributing to more thoughtful architectural decisions in LLM development.

Conclusion

Inheritune introduces a remarkably efficient approach for developing small base LMs through strategic inheritance of transformer blocks and smart utilization of limited data resources. Its success across various settings and model sizes emphasizes the potential to democratize access to performant LMs, paving the way for broader experimentation and innovation within the field of AI and natural language processing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SunnySanyal9/status/1779700347335741622

https://twitter.com/arankomatsuzaki/status/1779684709133385830

https://twitter.com/Prince_Canuma/status/1787168336096252046

https://twitter.com/SunnySanyal9/status/1904261830723215864

https://twitter.com/NYUDataScience/status/1882150113713570222

https://twitter.com/PandaAshwinee/status/1848824620951957692

YouTube

Show All Videos

Reddit

Inheritune: Training Smaller Yet More Attentive Language Models (61 points, 5 comments)