Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Unreasonable Ineffectiveness of the Deeper Layers (2403.17887v1)

Published 26 Mar 2024 in cs.CL, cs.LG, and stat.ML

Abstract: We empirically study a simple layer-pruning strategy for popular families of open-weight pretrained LLMs, finding minimal degradation of performance on different question-answering benchmarks until after a large fraction (up to half) of the layers are removed. To prune these models, we identify the optimal block of layers to prune by considering similarity across layers; then, to "heal" the damage, we perform a small amount of finetuning. In particular, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single A100 GPU. From a practical perspective, these results suggest that layer pruning methods can complement other PEFT strategies to further reduce computational resources of finetuning on the one hand, and can improve the memory and latency of inference on the other hand. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Language models are unsupervised multitask learners. 2019. URL https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
  2. OpenAI. Introducing chatgpt, Nov 2022. URL https://openai.com/blog/chatgpt.
  3. OpenAI. Gpt-4 technical report, 2023.
  4. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  5. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  6. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  7. Harm De Vries. Go smol or go home, July 2023. URL https://www.harmdevries.com/post/model-size-vs-compute-overhead/.
  8. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023.
  9. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  10. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  11. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pages 7750–7774. PMLR, 2023.
  12. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  13. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  14. Optimal brain damage. In D. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann, 1989.
  15. Second order derivatives for network pruning: Optimal brain surgeon. In S. Hanson, J. Cowan, and C. Giles, editors, Advances in Neural Information Processing Systems, volume 5. Morgan-Kaufmann, 1992.
  16. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  17. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
  18. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  19. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  20. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023a.
  21. nostalgebraist. interpreting gpt: the logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens, 2020.
  22. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.
  23. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
  24. Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023.
  25. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
  26. Compressing neural networks with the hashing trick. In International conference on machine learning, pages 2285–2294. PMLR, 2015.
  27. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015.
  28. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
  29. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
  30. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389–1397, 2017.
  31. Condensenet: An efficient densenet using learned group convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2752–2761, 2018.
  32. Auto-sizing neural networks: With applications to n-gram language models. arXiv preprint arXiv:1508.05051, 2015.
  33. Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274, 2016.
  34. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
  37. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  38. Fastformers: Highly efficient transformer models for natural language understanding. arXiv preprint arXiv:2010.13382, 2020.
  39. Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
  40. Accelerating training of transformer-based language models with progressive layer dropping. Advances in Neural Information Processing Systems, 33:14011–14023, 2020.
  41. Layer-wise model pruning based on mutual information. arXiv preprint arXiv:2108.12594, 2021.
  42. Large language model distillation doesn’t need a teacher. arXiv preprint arXiv:2305.14864, 2023.
  43. On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language, 77:101429, 2023.
  44. Comflp: Correlation measure based fast search on asr layer pruning. arXiv preprint arXiv:2309.11768, 2023a.
  45. Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems, 33:9782–9793, 2020.
  46. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558, 2023.
  47. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024.
  48. Structured pruning learns compact and accurate models. arXiv preprint arXiv:2204.00408, 2022.
  49. Block pruning for faster transformers. arXiv preprint arXiv:2109.04838, 2021.
  50. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  51. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. arXiv preprint arXiv:2302.10198, 2023.
  52. Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512, 2019.
  53. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  54. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  55. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
  56. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
  57. Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487, 2021.
  58. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
  59. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023a.
  60. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  61. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
  62. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
  63. Lion: Adversarial distillation of closed-source large language model. arXiv preprint arXiv:2305.12870, 2023a.
  64. Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659, 2023b.
  65. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023.
  66. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
  67. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024.
  68. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  69. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696, 2021.
  70. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. arXiv preprint arXiv:2301.04213, 2023.
  71. Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767, 2023.
  72. Jump to conclusions: Short-cutting transformers with linear transformations. arXiv preprint arXiv:2303.09435, 2023.
  73. Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023.
  74. Neurons in large language models: Dead, n-gram, positional. arXiv preprint arXiv:2309.04827, 2023.
  75. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR, 2023b.
  76. Task-specific skill localization in fine-tuned language models. arXiv preprint arXiv:2302.06600, 2023.
  77. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  78. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023b.
  79. Phi-2: The surprising power of small language models, Dec 2023.
  80. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  81. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  82. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  83. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
  84. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023b.
  85. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  86. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  87. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  88. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Andrey Gromov (50 papers)
  2. Kushal Tirumala (17 papers)
  3. Hassan Shapourian (43 papers)
  4. Paolo Glorioso (32 papers)
  5. Daniel A. Roberts (22 papers)
Citations (55)

Summary

An Analysis of "The Unreasonable Ineffectiveness of the Deeper Layers"

The paper "The Unreasonable Ineffectiveness of the Deeper Layers" by Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, and Daniel A. Roberts investigates a layer-pruning strategy for large-scale open-weight pretrained LLMs. Their primary contribution is the empirical finding that significant fractions of model layers, particularly the deeper ones, can be pruned with minimal degradation in performance across various question-answering (QA) benchmarks. The implications of their work span both practical efficiency improvements and theoretical insights into the architecture and robustness of modern LLMs.

Summary of Findings

The key finding of this paper is that models such as Llama-2-70B can tolerate pruning of up to nearly half of their layers before experiencing a critical degradation in performance. This robustness is observed across multiple models and benchmarks, indicating that the extra deep layers may not be as crucial as previously assumed. This challenges the current notion that deeper layers in LLMs are critical for maintaining high performance.

Methodology

To prune the models, the authors propose a method where the angular distance between representations at different layers, defined as: d(x(),x(+n))=1πarccos(xT()xT(+n)xT()xT(+n)),d(x^{(\ell)},x^{(\ell+n)}) = \frac{1}{\pi} \arccos \left( \frac{x^{(\ell)}_T \cdot x^{(\ell+n)}_T}{\left|\left|x^{(\ell)}_T\right|\right| \left|\left|x^{(\ell+n)}_T\right|\right| } \right), is computed across the network. Here, x()x^{(\ell)} represents the activation at layer \ell. They identify the most redundant block of layers to prune and, to mitigate any resulting performance drop, apply parameter-efficient fine-tuning (PEFT), specifically using quantization and Low-Rank Adapters (QLoRA). This combined strategy allows the researchers to perform significant pruning experiments on a single A100 GPU.

Evaluation

The effectiveness of this pruning strategy is evaluated on several LLMs, including the Llama-2, Qwen, Mistral, and Phi-2 models, using benchmarks such as MMLU (Massive Multitask Language Understanding) and BoolQ (Boolean Questions). Their experiments reveal:

  1. Performance Robustness: Models retain high performance on QA tasks up to pruning fractions of 20-55%, depending on the model family and size. For instance, Llama-2-70B retains robustness until approximately 50% of its layers are pruned.
  2. Healing Efficacy: After pruning, a small amount of fine-tuning (termed "healing") marginally but significantly improves the performance. This healing is especially critical for maintaining the autoregressive loss, which otherwise increases sharply without it.

Key Insights and Implications

Several theoretical and practical insights can be derived from these findings:

  1. Parameter Utilization: The robustness of LLMs to layer pruning suggests a potential inefficiency in the current utilization of deeper layers. Either current pretraining methods are not optimizing these parameters effectively, or the shallow layers are playing a disproportionately significant role in storing and processing information.
  2. Design of Efficient Models: Understanding that deeper layers can be pruned without severe performance loss opens pathways for designing more compute and memory-efficient models. This could significantly reduce the resource requirements for running large models, making them more accessible for practical applications such as real-time inference on consumer-grade hardware.
  3. Implications for Theoretical Research: The authors' results on sharpening the understanding of layer significance support a deeper investigation into the design and training procedures of LLMs. Specifically, whether different tasks require differing depths for optimal performance, and how layer-wise similarity metrics can guide further architectural refinements, remain open questions for future research.

Future Directions

The paper concludes by suggesting several directions for future research, such as exploring better layer-pruning and healing strategies, understanding the decoupling of QA performance from next-token prediction loss, and investigating how different pretraining methods and datasets influence the ability to prune. A particularly intriguing direction is examining the effective use of deeper layers, potentially leading to more advanced training paradigms that leverage all model parameters more efficiently.

In summary, this paper significantly contributes to the understanding and practical handling of LLMs by demonstrating that substantial layer pruning is feasible and beneficial. This finding not only aids in resource optimization but also prompts a reevaluation of how these models are architecturally and functionally understood.

Youtube Logo Streamline Icon: https://streamlinehq.com