Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity (2310.05175v3)

Published 8 Oct 2023 in cs.LG

Abstract: LLMs, renowned for their remarkable performance across diverse domains, present a challenge when it comes to practical deployment due to their colossal model size. In response to this challenge, efforts have been directed toward the application of traditional network pruning techniques to LLMs, uncovering a massive number of parameters that can be pruned in one-shot without hurting performance. Prevailing LLM pruning strategies have consistently adhered to the practice of uniformly pruning all layers at equivalent sparsity, resulting in robust performance. However, this observation stands in contrast to the prevailing trends observed in the field of vision models, where non-uniform layerwise sparsity typically yields stronger results. To understand the underlying reasons for this disparity, we conduct a comprehensive study and discover a strong correlation with the emergence of activation outliers in LLMs. Inspired by this finding, we introduce a novel LLM pruning methodology that incorporates a tailored set of non-uniform layerwise sparsity ratios, termed as Outlier Weighed Layerwise sparsity (OWL). The sparsity ratio of OWL is proportional to the outlier ratio observed within each layer, facilitating a more effective alignment between layerwise weight sparsity and outlier ratios. Our empirical evaluation, conducted across the LLaMA-V1 family and OPT, spanning various benchmarks, demonstrates the distinct advantages offered by OWL over previous methods. For instance, OWL exhibits a remarkable performance gain, surpassing the state-of-the-art Wanda and SparseGPT by 61.22 and 6.80 perplexity at a high sparsity level of 70%, respectively, while delivering 2.6x end-to-end inference speed-up in the DeepSparse inference engine. Codes are available at https://github.com/luuyin/OWL.

Outlier Weighed Layerwise Sparsity: Advanced Pruning Techniques for LLMs

The paper "Outlier Weighed Layerwise Sparsity (OWL 69): A Missing Secret Sauce for Pruning LLMs to High Sparsity" introduces a novel approach to pruning LLMs with a focus on leveraging the unique properties of activation outliers. OWL, as proposed by the authors, emerges as a groundbreaking concept that challenges the conventional wisdom of uniform sparsity pruning, advocating for a tailored, non-uniform layerwise sparsity approach. The paper provides an empirical investigation into the distribution of activation outliers across layers of LLMs and proposes a layerwise pruning strategy that aligns sparsity with outlier distributions, enhancing model performance and inference speed without requiring extensive retraining.

The authors meticulously analyze the limitations of existing pruning methodologies, such as SparseGPT and Wanda, which adhere to uniform sparsity across layers. They reveal a significant relationship between the presence of activation outliers and the distribution of pruning efficacy, suggesting that incorporating these outliers into the pruning process can yield substantial performance improvements. By adjusting layerwise sparsity ratios according to the prevalence of outliers in each layer, OWL optimizes pruning without disregarding the structure of activation outliers, which are often pivotal to LLM performance.

Experimental results underscore the effectiveness of OWL, demonstrating a notable improvement over baseline methods in terms of perplexity and inference efficiency. Specifically, OWL outperforms leading techniques like Wanda and SparseGPT by achieving a 61.22 perplexity improvement at a high sparsity level of 70% and a 2.6x inference speedup on the DeepSparse engine. OWL not only excellent performance of large LLaMA-V1 and OPT models, encompassing a parameter scale from billions to tens of billions but also exhibits robustness across a mixture of model sizes and architectures.

Theoretical implications of OWL extend beyond the immediate field of LLM pruning. This method promotes a deeper understanding of how feature magnitudes and network architectures interplay, potentially influencing future research on the compression and efficiency of neural networks. The findings suggest that the optimal pruning strategy should vary across different model architectures and use cases, taking into account unique characteristics such as layer-specific outlier distributions and their impact on computational resources.

The paper also explores practical applications of OWL in diverse contexts, including structured pruning, mixed-precision quantization, and low-rank approximations, suggesting promising avenues for deployment in hardware-constrained environments. The comprehensive evaluation and robust demonstrations of OWL attest to its potential in revolutionizing the approach to LLM sparsity and optimization, setting the stage for more nuanced, adaptable strategies that can better adapt to the constraints of resource-limited scenarios.

In summary, the introduction of Outlier Weighed Layerwise Sparsity marks a pivotal advancement in the field of model pruning. By addressing the limitations of conventional uniform layerwise sparsity, OWL not only advances the current understanding of LLM pruning strategies but also opens pathways for future research to explore adaptive pruning regimes that align model efficiency with practical deployment needs. As AI continues to progress towards more fine-grained, tailor-fit methodologies, OWL stands as a testament to the importance of considering emergent phenomena within large-scale models to drive forward the practical applicability and sustainability of artificial intelligence systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Leveraging redundancy in attention with reuse transformers. arXiv preprint arXiv:2110.06821, 2021.
  2. Language models are few-shot learners. Advances in neural information processing systems (NeurIPs), 33:1877–1901, 2020.
  3. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  4. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  5. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems (NeurIPs), 2022.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. On random graphs i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
  8. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning (ICML), pp.  2943–2952, 2020.
  9. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.
  10. Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning (ICML), 2023.
  11. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  12. Sparse gpu kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–14. IEEE, 2020.
  13. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NeurIPS), pp.  1135–1143, 2015.
  14. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp.  293–299. IEEE, 1993.
  15. The emergence of essential sparsity in large pre-trained models: The weights that matter. arXiv preprint arXiv:2306.03805, 2023.
  16. Steven A Janowsky. Pruning versus clipping in neural networks. Physical Review A, 39(12):6600, 1989.
  17. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
  18. Optimal brain damage. In Advances in Neural Information Processing Systems (NeurIPS), pp.  598–605, 1989.
  19. Layer-adaptive sparsity for the magnitude-based pruning. arXiv preprint arXiv:2010.07611, 2020.
  20. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations (ICLR), 2019.
  21. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  22. Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2790–2799, 2019.
  23. Sparse training via boosting pruning plasticity with neuroregeneration. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  24. The unreasonable effectiveness of random pruning: Return of the most naive baseline for sparse training. arXiv preprint arXiv:2202.02643, 2022.
  25. Estimating the carbon footprint of bloom, a 176b parameter language model. arXiv preprint arXiv:2211.02001, 2022.
  26. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
  27. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016a.
  28. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016b.
  29. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  30. Studying the plasticity in deep convolutional neural networks using random pruning. Machine Vision and Applications, 30(2):203–216, 2019.
  31. A topological insight into restricted boltzmann machines. Machine Learning, 104(2):243–270, Sep 2016. ISSN 1573-0565. doi: 10.1007/s10994-016-5570-z. URL https://doi.org/10.1007/s10994-016-5570-z.
  32. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9:1–12, 2018.
  33. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems (NeurIPS), pp.  107–115, 1989.
  34. Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350, 2021.
  35. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  36. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
  37. Movement pruning: Adaptive sparsity by fine-tuning. arXiv preprint arXiv:2005.07683, 2020.
  38. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023.
  39. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  41. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  42. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  43. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations (ICLR), 2020.
  44. Rethinking the value of transformer components. arXiv preprint arXiv:2011.03803, 2020.
  45. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  46. Learning intrinsic sparse structures within long short-term memory. arXiv preprint arXiv:1709.05027, 2017.
  47. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning (ICML), pp.  38087–38099. PMLR, 2023.
  48. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  49. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  50. To prune, or not to prune: exploring the efficacy of pruning for model compression. In International Conference on Learning Representations Workshop (ICLRW), 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Lu Yin (85 papers)
  2. You Wu (60 papers)
  3. Zhenyu Zhang (249 papers)
  4. Cheng-Yu Hsieh (23 papers)
  5. Yaqing Wang (59 papers)
  6. Yiling Jia (10 papers)
  7. Mykola Pechenizkiy (118 papers)
  8. Yi Liang (58 papers)
  9. Zhangyang Wang (374 papers)
  10. Shiwei Liu (75 papers)
  11. Gen Li (143 papers)
  12. Ajay Jaiswal (35 papers)
  13. Michael Bendersky (63 papers)
Citations (52)
X Twitter Logo Streamline Icon: https://streamlinehq.com