Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning (2310.06694v2)

Published 10 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized LLMs highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

PDF Abstract

Sheared LLaMA: Efficient LLM Scaling Through Structured Pruning

The paper "Sheared LLaMA: Scaling Down LLMs Efficiently via Structured Pruning" presents a methodological innovation in the domain of LLM development. This work primarily addresses the computational inefficiencies involved in training LLMs from scratch by introducing a novel approach to derive smaller, high-performing models from pre-existing larger models through structured pruning.

Methodological Innovations

The cornerstone of this approach is the use of structured pruning to efficiently downscale large pre-trained models. The paper introduces two main techniques:

Targeted Structured Pruning: This involves pruning a large model to a predetermined architecture by systematically removing layers, attention heads, and dimensions in hidden layers. The process is designed to maintain model performance while achieving a more compact architecture.
Dynamic Batch Loading: This technique adapts the composition of training batches dynamically, based on loss variations across different data domains. This ensures efficient data utilization during continued pre-training post-pruning.

These methodologies culminate in the creation of the Sheared-LLaMA series, specifically reducing LLaMA2-7B models to variants with 1.3B and 2.7B parameters. Remarkably, these reduced models outperform several state-of-the-art open-source alternatives, such as Pythia, INCITE, and OpenLLaMA, on diverse downstream tasks and instruction tuning evaluations while utilizing merely 3% of the computational resources required to train models from scratch.

Numerical and Comparative Insights

The research demonstrates strong quantitative results, with Sheared-LLaMA models surpassing both prior small-scale and some contemporary LLMs in terms of efficiency and performance:

Successfully reducing the LLaMA2-7B model to a 1.3B variant while delivering superior performance compared to similarly-sized models.
Using only 50 billion tokens for pruning and subsequent training, indicating a high level of sample efficiency relative to training anew.

This is achieved through a pruning process that retains critical parameters and structures from the originally larger models, followed by domain-specific optimization of training batches to ensure uniform improvement across data subsets.

Implications and Future Prospects

The practical implications of these findings are substantial:

Cost Efficiency: Reducing computational requirements while keeping strong performance benchmarks, making LLMs more accessible for smaller entities lacking substantial computational resources.
Scalability: The method provides a scalable path to deploying LLMs on a wider range of applications and platforms where computational efficiency and performance trade-offs are critical considerations.

Theoretically, the approach suggests that substantial efficiency gains can be realized through intelligent reuse of existing LLMs, challenging the paradigm of always training models from scratch. The methodology also opens avenues for further research into adaptive model architectures which dynamically reshape themselves according to task-specific demands.

Conclusion

The Sheared-LLaMA initiative presents a compelling case for rethinking the current LLM training paradigms by leveraging structured pruning and dynamic data utilization. The results indicate not only the feasibility but the advantages of this approach in creating efficient, smaller models capable of high performance across various NLP tasks. This work represents a valuable contribution to the field, suggesting that the future of LLM development may lie in optimizing and refining existing resources rather than continually expanding them.