ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853v3)

Published 6 Mar 2024 in cs.CL

Abstract: As LLMs continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.

PDF HTML Abstract

ShortGPT: Highlighting Layer Redundancy in LLMs

Exploring Model Compression Through Layer Pruning

Recent advancements in LLMs have significantly increased model size, creating challenges for their deployment due to high computational and hardware requirements. In a novel approach, the paper seeks to explore model compression by directly addressing layer redundancy in LLMs, a facet that has not been extensively explored previously. The introduction of a straightforward layer-removal strategy, termed ShortGPT, is predicated on the evaluation of layer importance through a new metric, Block Influence (BI). This technique uncovers substantial model redundancies, proposing an efficient pathway to model compression without the complexities inherent in other pruning methods.

Analyzing Layer Redundancy

The structure of LLMs, particularly those based on the Transformer model, presents a stacked configuration of layers with attention mechanisms at their core. This paper illuminates a significant redundancy among these layers, revealing that not all contribute equally to the model's output. By examining the removal of specific layers—without significant impact on the model’s performance—it's clear that layer redundancy presents a viable target for model compression. For instance, removing up to 55% of the total layers from a LLaMA model retains a majority of its performance metrics, a surprising outcome that questions the necessity of each layer’s presence.

Introducing Block Influence (BI)

The core of ShortGPT's methodology is the quantification of layer importance through the BI metric. This innovative approach measures the transformation effectiveness of each layer on the hidden states, serving as an indicator of its contribution to the overall model function. Unlike other metrics that might consider weight magnitudes or gradient-based importance, BI focuses on the layer's operational impact, providing a more functional assessment of its significance.

ShortGPT: A Pragmatic Approach

The methodology underpinning ShortGPT is elegantly simple. By evaluating each layer's BI score on a calibration set, layers are ranked by their influence. This ranking facilitates a judicious layer removal strategy, where those with the lowest BI scores are pruned. This process not only maintains the integrity and performance of the model but also significantly reduces its size. Empirical results underscore ShortGPT's efficiency, showing notable model size reductions with minimal impact on performance metrics.

Beyond Pruning: Implications and Future Directions

This paper's findings provoke a reconsideration of model architecture and its efficiencies. The identification of substantial redundancies within LLMs indicates that a more nuanced approach to model construction might be warranted, possibly affecting future architectural choices. Additionally, ShortGPT's compatibility with quantization techniques opens further avenues for comprehensive model size reduction, marrying simplicity with effectiveness.

Conclusion

ShortGPT's exploration into layer redundancy and the introduction of the BI metric provide a novel lens through which the architecture of LLMs can be optimized for practical deployment. By challenging the assumed necessity of every layer and offering a method that not only simplifies but enhances the efficiency of model pruning, this research strides toward making sophisticated LLMs more accessible across a variety of platforms and applications. As the AI community continues to push the boundaries of what's possible with LLMs, studies like these ensure that these advancements remain within reach, both technically and practically.