A deeper look at depth pruning of LLMs (2407.16286v1)
Abstract: LLMs are not only resource-intensive to train but even more costly to deploy in production. Therefore, recent work has attempted to prune blocks of LLMs based on cheap proxies for estimating block importance, effectively removing 10% of blocks in well-trained LLaMa-2 and Mistral 7b models without any significant degradation of downstream metrics. In this paper, we explore different block importance metrics by considering adaptive metrics such as Shapley value in addition to static ones explored in prior work. We show that adaptive metrics exhibit a trade-off in performance between tasks i.e., improvement on one task may degrade performance on the other due to differences in the computed block influences. Furthermore, we extend this analysis from a complete block to individual self-attention and feed-forward layers, highlighting the propensity of the self-attention layers to be more amendable to pruning, even allowing removal of upto 33% of the self-attention layers without incurring any performance degradation on MMLU for Mistral 7b (significant reduction in costly maintenance of KV-cache). Finally, we look at simple performance recovery techniques to emulate the pruned layers by training lightweight additive bias or low-rank linear adapters. Performance recovery using emulated updates avoids performance degradation for the initial blocks (up to 5% absolute improvement on MMLU), which is either competitive or superior to the learning-based technique.
- OpenWebText corpus. https://github.com/jcpeterson/openwebtext, 2019. Accessed: 2024-04-16.
- Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024.
- Revisiting model stitching to compare neural representations. Advances in neural information processing systems, 34:225–236, 2021.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- The unreasonable ineffectiveness of the deeper layers. arXiv preprint arXiv:2403.17887, 2024.
- Toxigen: A large-scale machine-generated dataset for implicit and adversarial hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Ffn-skipllm: A hidden gem for autoregressive decoding with adaptive feed forward skipping. arXiv preprint arXiv:2404.03865, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Shortened llama: A simple depth pruning for large language models. arXiv preprint arXiv:2402.02834, 2024.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
- Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
- Dynamic memory compression: Retrofitting llms for accelerated inference. arXiv preprint arXiv:2403.09636, 2024.
- The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Mixture-of-depths: Dynamically allocating compute in transformer-based language models. arXiv preprint arXiv:2404.02258, 2024.
- Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
- Weight subcloning: direct initialization of transformers using larger pretrained ones. arXiv preprint arXiv:2312.09299, 2023.
- Shapley, L. S. et al. A value for n-person games. 1953.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- The llm surgeon. arXiv preprint arXiv:2312.17244, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Convolutional networks with adaptive inference graphs. In Proceedings of the European conference on computer vision (ECCV), pp. 3–18, 2018.
- Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29, 2016.
- Efficient large language models: A survey. arXiv preprint arXiv:2312.03863, 1, 2023.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Shoaib Ahmed Siddiqui (22 papers)
- Xin Dong (90 papers)
- Greg Heinrich (12 papers)
- Thomas Breuel (16 papers)
- Jan Kautz (215 papers)
- David Krueger (75 papers)
- Pavlo Molchanov (70 papers)