Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect (2403.03853v3)

Published 6 Mar 2024 in cs.CL

Abstract: As LLMs continue to advance in performance, their size has escalated significantly, with current LLMs containing billions or even trillions of parameters. However, in this study, we discovered that many layers of LLMs exhibit high similarity, and some layers play a negligible role in network functionality. Based on this observation, we define a metric called Block Influence (BI) to gauge the significance of each layer in LLMs. We then propose a straightforward pruning approach: layer removal, in which we directly delete the redundant layers in LLMs based on their BI scores. Experiments demonstrate that our method, which we call ShortGPT, significantly outperforms previous state-of-the-art (SOTA) methods in model pruning. Moreover, ShortGPT is orthogonal to quantization-like methods, enabling further reduction in parameters and computation. The ability to achieve better results through simple layer removal, as opposed to more complex pruning techniques, suggests a high degree of redundancy in the model architecture.

ShortGPT: Highlighting Layer Redundancy in LLMs

Exploring Model Compression Through Layer Pruning

Recent advancements in LLMs have significantly increased model size, creating challenges for their deployment due to high computational and hardware requirements. In a novel approach, the paper seeks to explore model compression by directly addressing layer redundancy in LLMs, a facet that has not been extensively explored previously. The introduction of a straightforward layer-removal strategy, termed ShortGPT, is predicated on the evaluation of layer importance through a new metric, Block Influence (BI). This technique uncovers substantial model redundancies, proposing an efficient pathway to model compression without the complexities inherent in other pruning methods.

Analyzing Layer Redundancy

The structure of LLMs, particularly those based on the Transformer model, presents a stacked configuration of layers with attention mechanisms at their core. This paper illuminates a significant redundancy among these layers, revealing that not all contribute equally to the model's output. By examining the removal of specific layers—without significant impact on the model’s performance—it's clear that layer redundancy presents a viable target for model compression. For instance, removing up to 55% of the total layers from a LLaMA model retains a majority of its performance metrics, a surprising outcome that questions the necessity of each layer’s presence.

Introducing Block Influence (BI)

The core of ShortGPT's methodology is the quantification of layer importance through the BI metric. This innovative approach measures the transformation effectiveness of each layer on the hidden states, serving as an indicator of its contribution to the overall model function. Unlike other metrics that might consider weight magnitudes or gradient-based importance, BI focuses on the layer's operational impact, providing a more functional assessment of its significance.

ShortGPT: A Pragmatic Approach

The methodology underpinning ShortGPT is elegantly simple. By evaluating each layer's BI score on a calibration set, layers are ranked by their influence. This ranking facilitates a judicious layer removal strategy, where those with the lowest BI scores are pruned. This process not only maintains the integrity and performance of the model but also significantly reduces its size. Empirical results underscore ShortGPT's efficiency, showing notable model size reductions with minimal impact on performance metrics.

Beyond Pruning: Implications and Future Directions

This paper's findings provoke a reconsideration of model architecture and its efficiencies. The identification of substantial redundancies within LLMs indicates that a more nuanced approach to model construction might be warranted, possibly affecting future architectural choices. Additionally, ShortGPT's compatibility with quantization techniques opens further avenues for comprehensive model size reduction, marrying simplicity with effectiveness.

Conclusion

ShortGPT's exploration into layer redundancy and the introduction of the BI metric provide a novel lens through which the architecture of LLMs can be optimized for practical deployment. By challenging the assumed necessity of every layer and offering a method that not only simplifies but enhances the efficiency of model pruning, this research strides toward making sophisticated LLMs more accessible across a variety of platforms and applications. As the AI community continues to push the boundaries of what's possible with LLMs, studies like these ensure that these advancements remain within reach, both technically and practically.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024.
  2. On attention redundancy: A comprehensive study. In Proceedings of the 2021 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 930–945.
  3. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, pages 7432–7439.
  4. Edward A Catchpole and Byron JT Morgan. 1997. Detecting parameter redundancy. Biometrika, 84(1):187–196.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  6. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  8. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  10. Analyzing redundancy in pretrained transformer models.
  11. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
  12. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
  13. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  15. Elias Frantar and Dan Alistarh. 2023. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
  16. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC.
  17. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819.
  18. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
  19. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28.
  20. Xl-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703.
  21. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  22. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  23. Training compute-optimal large language models.
  24. A new measure of model redundancy for compressed convolutional neural networks.
  25. Scaling laws for neural language models.
  26. Distillm: Towards streamlined distillation for large language models. arXiv preprint arXiv:2402.03898.
  27. Race: Large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 785–794.
  28. Optimal brain damage. Advances in neural information processing systems, 2.
  29. Cmmlu: Measuring massive multitask language understanding in chinese.
  30. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726.
  31. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34:28092–28103.
  32. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36.
  33. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557.
  34. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
  35. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Interspeech, pages 3743–3747.
  36. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
  37. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations.
  38. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249–266.
  39. Weight subcloning: direct initialization of transformers using larger pretrained ones.
  40. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  41. Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 8:141–155.
  42. Prune and tune: Improving efficient pruning techniques for massive language models. Arxiv.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  44. Attention is all you need. Advances in neural information processing systems, 30.
  45. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
  46. Clue: A chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4762–4772.
  47. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526.
  48. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  49. Laco: Large language model pruning via layer collapse.
  50. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  51. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
  52. Pruning meets low-rank parameter-efficient fine-tuning. arXiv preprint arXiv:2305.18403.
  53. Chid: A large-scale chinese idiom dataset for cloze test. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 778–787.
  54. A survey on model compression for large language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xin Men (5 papers)
  2. Mingyu Xu (45 papers)
  3. Qingyu Zhang (8 papers)
  4. Bingning Wang (29 papers)
  5. Hongyu Lin (94 papers)
  6. Yaojie Lu (61 papers)
  7. Xianpei Han (103 papers)
  8. Weipeng Chen (56 papers)
Citations (57)
Youtube Logo Streamline Icon: https://streamlinehq.com