SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot (2301.00774v3)

Published 2 Jan 2023 in cs.LG

Abstract: We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches. The code is available at: https://github.com/IST-DASLab/sparsegpt.

PDF Abstract

Essay: SparseGPT: Efficient Pruning of Large-Scale LLMs

Introduction and Core Contribution

The paper "SparseGPT: Massive LLMs Can be Accurately Pruned in One-Shot" presents a novel methodology for pruning large-scale generative pretrained transformer (GPT) models, achieving significant sparsity without the necessity for retraining. This approach, termed SparseGPT, addresses the challenge of compressing LLMs, which possess millions to billions of parameters, thus reducing computational costs and resource intensity.

SparseGPT departs from traditional model compression techniques largely focused on quantization by specializing in pruning, an alternative approach to reducing model size. The contributions of SparseGPT are particularly relevant for state-of-the-art open-source LLMs such as OPT-175B and BLOOM-176B, highlighting its capability to prune up to 60% of the model parameters while maintaining nominal loss in accuracy, specifically assessing metrics like perplexity and zero-shot accuracy.

SparseGPT Methodology

SparseGPT operates by reformulating the pruning problem into a series of large-scale sparse regression problems. This perspective allows the method to identify optimal pruning paths using new sparse regression techniques that scale efficiently to models exceeding 100 billion parameters. SparseGPT’s design ensures local pruning at the layer level, circumventing the need for global gradient information, which requires additional computational expenses.

The approach integrates concepts from optimal brain surgeon (OBS) updates, leveraging layer-wise sparse regression to achieve a balance between sparsity and model accuracy. Furthermore, SparseGPT supports semi-structured sparsity patterns such as 2:4 and 4:8 configurations compatible with existing hardware acceleration techniques.

Experimental Validation

SparseGPT was tested extensively on well-known benchmark datasets for LLMs using a suite of accuracy metrics. The experiments consistently demonstrated that SparseGPT outperforms existing magnitude-based pruning techniques and other state-of-the-art post-training methods, especially in models larger than 10 billion parameters.

In evaluations across various datasets, substantial improvements were seen in retaining model accuracy at much higher sparsity levels, with SparseGPT enabling removal of over 100 billion weights in models like OPT-175B and BLOOM-176B with minimal accuracy degradation. Such performance is indicative of the approach's robustness and adaptability to large model architectures.

Implications and Future Directions

The success of SparseGPT has both theoretical and practical implications. Theoretically, the results suggest that the high degree of parameter redundancy intrinsic to large GPT models makes even significant sparsity achievable without retraining. Practically, this enhances model deployment by reducing computational resource demands, contributing to greener AI by lowering energy expenditure.

SparseGPT opens avenues for further research into sparsity dynamics in AI model training and the potential refinement of model architectures optimized for sparse execution. Future work may examine integrating SparseGPT into end-to-end model training processes or exploring joint optimization frameworks involving other compression techniques such as quantization, potentially achieving superior trade-offs between model size, speed, and accuracy.

Conclusion

SparseGPT represents a pivotal stride toward efficient and scalable pruning of massive LLMs. By providing a practical and theoretically sound approach to model compression, it significantly advances the field of model efficiency. Its ability to maintain high accuracy even with high levels of sparsity positions it as a critical tool in the ongoing effort to optimize LLMs for real-world applications and resource constraints. As the landscape of LLMs continues to evolve, tools like SparseGPT will be instrumental in bringing cutting-edge AI capabilities into accessible and sustainable use.