Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning (2208.11580v2)

Published 24 Aug 2022 in cs.LG

Abstract: We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data. This problem has become popular in view of the emerging software and hardware support for executing models compressed via pruning and/or quantization with speedup, and well-performing solutions have been proposed independently for both compression approaches. In this paper, we introduce a new compression framework which covers both weight pruning and quantization in a unified setting, is time- and space-efficient, and considerably improves upon the practical performance of existing post-training methods. At the technical level, our approach is based on an exact and efficient realization of the classical Optimal Brain Surgeon (OBS) framework of [LeCun, Denker, and Solla, 1990] extended to also cover weight quantization at the scale of modern DNNs. From the practical perspective, our experimental results show that it can improve significantly upon the compression-accuracy trade-offs of existing post-training methods, and that it can enable the accurate compound application of both pruning and quantization in a post-training setting.

PDF Abstract

Analyzing the "Optimal Brain Compression" Framework for Post-Training Quantization and Pruning

The paper "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning" presents an innovative approach to model compression in deep neural networks (DNNs) within a one-shot/post-training setting. The authors, Elias Frantar, Sidak Pal Singh, and Dan Alistarh, from IST Austria and ETH Zurich, propose a unified framework that addresses both weight pruning and quantization, leveraging the classical Optimal Brain Surgeon (OBS) framework. This contribution is of particular significance in scenarios where only limited calibration data is available, yet efficient deep neural network deployment is required.

Technical Approach

The core of the paper is the application of an enhanced OBS framework to both pruning and quantization tasks. This framework utilizes second-order information through Hessian matrices to determine which parts of the neural network can be pruned or quantized with minimal impact on accuracy. The authors have designed algorithms to significantly reduce the computational cost associated with this approach, making it feasible to apply at the scale of modern DNNs. Specifically, they are able to achieve a reduction in computational complexity from potentially infeasible costs to a more manageable $O(d \cdot d_{col}^2)$ , where $d_{col}$ is a dimension of the weight matrix.

Experimental Evaluation

The paper provides a thorough experimental evaluation across standard tasks such as image classification, object detection, and LLMing. Significant improvements in compression-accuracy trade-offs are reported when compared to existing methods. For instance, using their method, a ResNet50 model exhibited only a slight drop in accuracy (around 2%) while achieving a 12× reduction in theoretical operations. In practical runtime scenarios, a further 4× speedup with only a 1% accuracy drop was noted using a CPU-based sparsity-aware runtime.

Implications and Future Directions

This research potentially shifts the paradigm in post-training model compression by demonstrating that carefully applied, combined pruning and quantization can achieve competitive results to more resource-expensive retraining approaches. The practical implications of this work are significant, particularly as hardware subsystems increasingly support sparse and low-bitwidth operations. This suggests that there may be further avenues to explore in extending this framework to other forms of structured pruning or broader applications in very large-scale models, such as those used in natural language processing.

Conclusion

The proposed Optimal Brain Compression framework represents a substantial stride in the direction of efficient post-training model compression. By leveraging exact and efficient approximations of the OBS framework, it addresses the dual challenge of maintaining model accuracy while achieving significant compression and computational efficiency. Future research could expand upon this groundwork to explore additional optimizations, broader model architectures, and further integration with emerging hardware capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Elias Frantar (24 papers)
Sidak Pal Singh (22 papers)
Dan Alistarh (133 papers)

Citations (173)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/burrhhh/status/1806703154538635339

https://twitter.com/wendlerch/status/1746806073938288819