Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A deeper look at depth pruning of LLMs (2407.16286v1)

Published 23 Jul 2024 in cs.LG and cs.AI

Abstract: LLMs are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the self-attention layers to be more amendable to pruning, even allowing removal of upto 33% of the self-attention layers without incurring any performance degradation on MMLU for Mistral 7b (significant reduction in costly maintenance of KV-cache). Finally, we look at simple performance recovery techniques to emulate the pruned layers by training lightweight additive bias or low-rank linear adapters. Performance recovery using emulated updates avoids performance degradation for the initial blocks (up to 5% absolute improvement on MMLU), which is either competitive or superior to the learning-based technique.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. OpenWebText corpus. https://github.com/jcpeterson/openwebtext, 2019. Accessed: 2024-04-16.
  2. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024.
  3. Revisiting model stitching to compare neural representations. Advances in neural information processing systems, 34:225–236, 2021.
  4. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  5. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  11. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  12. The unreasonable ineffectiveness of the deeper layers. arXiv preprint arXiv:2403.17887, 2024.
  13. Toxigen: A large-scale machine-generated dataset for implicit and adversarial hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  15. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  17. Ffn-skipllm: A hidden gem for autoregressive decoding with adaptive feed forward skipping. arXiv preprint arXiv:2404.03865, 2024.
  18. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  19. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  20. Shortened llama: A simple depth pruning for large language models. arXiv preprint arXiv:2402.02834, 2024.
  21. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  22. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
  23. Dynamic memory compression: Retrofitting llms for accelerated inference. arXiv preprint arXiv:2403.09636, 2024.
  24. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  26. Mixture-of-depths: Dynamically allocating compute in transformer-based language models. arXiv preprint arXiv:2404.02258, 2024.
  27. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
  28. Weight subcloning: direct initialization of transformers using larger pretrained ones. arXiv preprint arXiv:2312.09299, 2023.
  29. Shapley, L. S. et al. A value for n-person games. 1953.
  30. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  32. The llm surgeon. arXiv preprint arXiv:2312.17244, 2023.
  33. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  34. Convolutional networks with adaptive inference graphs. In Proceedings of the European conference on computer vision (ECCV), pp.  3–18, 2018.
  35. Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
  36. Efficient large language models: A survey. arXiv preprint arXiv:2312.03863, 1, 2023.
  37. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  38. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  39. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shoaib Ahmed Siddiqui (22 papers)
  2. Xin Dong (90 papers)
  3. Greg Heinrich (12 papers)
  4. Thomas Breuel (16 papers)
  5. Jan Kautz (215 papers)
  6. David Krueger (75 papers)
  7. Pavlo Molchanov (70 papers)
Citations (4)

Summary

An Expert Perspective on Depth Pruning of LLMs

The paper "A deeper look at depth pruning of LLMs" by Shoaib Ahmed Siddiqui et al. presents an in-depth analysis of depth pruning methodologies for LLMs, particularly focusing on minimizing resource consumption without sacrificing model performance. This work builds upon prior research by introducing advanced metrics for block importance and exploring fine-grained pruning strategies within model layers.

Core Contributions

The paper is structured around several key contributions:

  1. Evaluation of Block Importance Metrics: The authors critically assess various block influence metrics beyond the conventional cosine similarity used by previous work. They introduce adaptive metrics such as the Shapley value and evaluate their efficacy in pruning decisions. The analysis underscores a trade-off inherent in adaptive metrics, where optimizing for one task can inadvertently degrade performance in another.
  2. Layer-Specific Pruning: Extending beyond whole-block pruning, the paper dissects transformer blocks into self-attention and feed-forward layers. The findings indicate a higher resilience to pruning self-attention layers. Notably, the paper demonstrates that up to 33% of self-attention layers in the Mistral 7b model can be removed without significant performance degradation on the MMLU benchmark.
  3. Performance Recovery Techniques: To address the performance drop from pruning, the authors propose a simple yet effective technique: emulated updates based on the empirical mean of a block's update. This is compared with low-rank linear adapters, which showed that a straightforward average update achieves competitive, if not superior, results.

Experimental Design and Findings

The experiments are comprehensive, utilizing two notable models, LLaMa-2 7b and Mistral 7b. The evaluation spans multiple metrics and tasks, providing nuanced insights into the impact of depth pruning. Key findings include:

  • Block Influence Metrics: The paper demonstrates that static metrics like cosine similarity offer stable performance for broad tasks. However, Shapley value, an adaptive metric, provides a significant improvement in model loss but can negatively impact specific task performance, such as MMLU accuracy. This suggests a potential for task-specific pruning strategies.
  • Layer Pruning: When evaluating layers individually, it was found that self-attention layers can be pruned with minimal impact on overall performance, contrary to feed-forward layers. This insight is critical for optimizing model efficiency, as self-attention mechanisms contribute significantly to computational overhead.
  • Performance Recovery: The paper highlights that simple techniques like emulated updates, which apply the average block update, can effectively mitigate performance drops. This approach is either on par with or outperforms more complex learning-based techniques like low-rank adapters, thereby offering a pragmatic solution for maintaining model accuracy post-pruning.

Implications and Future Directions

The implications of this research are multifaceted, impacting both theoretical understanding and practical deployment of LLMs. Practically, the insights into block and layer-specific pruning can lead to significant reductions in computational and memory requirements, making LLMs more accessible for deployment at scale.

Theoretically, the work opens avenues for further exploration of adaptive metrics tailored to specific tasks, potentially leveraging the nuances of Shapley values. The trade-offs identified between different metrics and tasks underline the need for more sophisticated, perhaps hybrid, pruning strategies that can balance performance across varied applications.

Future research may focus on dynamically adjusting model architecture based on real-time performance feedback, further optimizing efficiency without a priori fixed pruning schedules. Additionally, enhancing the robustness of simple performance recovery techniques could provide more reliable fallback mechanisms, ensuring models maintain high utility even with significant structural modifications.

In conclusion, the paper "A deeper look at depth pruning of LLMs" provides a rigorous and detailed examination of pruning strategies, presenting actionable insights and laying the groundwork for future advancements in the efficient deployment of LLMs.

X Twitter Logo Streamline Icon: https://streamlinehq.com