Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Streamlining Redundant Layers to Compress Large Language Models (2403.19135v4)

Published 28 Mar 2024 in cs.CL and cs.AI

Abstract: This paper introduces LLM-Streamline, a pioneer work on layer pruning for LLMs. It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  2. A survey on model compression for large language models, 2023.
  3. Model compression and efficient inference for large language models: A survey, 2024.
  4. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  5. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
  6. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726, 2022.
  7. In-context learning distillation: Transferring few-shot learning ability of pre-trained language models. arXiv preprint arXiv:2212.10670, 2022.
  8. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071, 2022.
  9. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems, 34:28092–28103, 2021.
  10. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  11. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022.
  12. Learning sparse neural networks through l⁢_⁢0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
  13. Lorashear: Efficient large language model structured pruning and knowledge recovery. arXiv preprint arXiv:2310.18356, 2023.
  14. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR, 2023.
  15. Beyond size: How gradients shape pruning decisions in large language models. arXiv preprint arXiv:2311.04902, 2023.
  16. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  17. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  18. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024.
  19. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  20. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned, 2019.
  21. Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360, 2019.
  22. When bert plays the lottery, all tickets are winning. arXiv preprint arXiv:2005.00561, 2020.
  23. The llm surgeon. arXiv preprint arXiv:2312.17244, 2023.
  24. Ye Liu and Michael K Ng. Deep neural network compression by tucker decomposition with nonlinear response. Knowledge-Based Systems, 241:108171, 2022.
  25. Semi-orthogonal low-rank matrix factorization for deep neural networks. In Interspeech, pages 3743–3747, 2018.
  26. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526, 2023.
  27. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720, 2023.
  28. Lion: Adversarial distillation of closed-source large language model. arXiv preprint arXiv:2305.12870, 2023.
  29. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  31. Laco: Large language model pruning via layer collapse. arXiv preprint arXiv:2402.11187, 2024.
  32. Clue: A chinese language understanding evaluation benchmark. arXiv preprint arXiv:2004.05986, 2020.
  33. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  34. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  35. Chid: A large-scale chinese idiom dataset for cloze test. arXiv preprint arXiv:1906.01265, 2019.
  36. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  37. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2018.
  38. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  39. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  40. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
  41. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  42. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
  43. Investigating prior knowledge for challenging chinese machine reading comprehension. Transactions of the Association for Computational Linguistics, 8:141–155, 2020.
  44. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  45. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
  46. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  47. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  48. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  49. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. GitHub repository, 2023.
  50. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
  51. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  52. Lora: Low-rank adaptation of large language models, 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.