A Programmable Approach to Neural Network Compression (1911.02497v2)

Published 6 Nov 2019 in cs.LG, cs.CV, and stat.ML

Abstract: Deep neural networks (DNNs) frequently contain far more weights, represented at a higher precision, than are required for the specific task which they are trained to perform. Consequently, they can often be compressed using techniques such as weight pruning and quantization that reduce both the model size and inference time without appreciable loss in accuracy. However, finding the best compression strategy and corresponding target sparsity for a given DNN, hardware platform, and optimization objective currently requires expensive, frequently manual, trial-and-error experimentation. In this paper, we introduce a programmable system for model compression called Condensa. Users programmatically compose simple operators, in Python, to build more complex and practically interesting compression strategies. Given a strategy and user-provided objective (such as minimization of running time), Condensa uses a novel Bayesian optimization-based algorithm to automatically infer desirable sparsities. Our experiments on four real-world DNNs demonstrate memory footprint and hardware runtime throughput improvements of 188x and 2.59x, respectively, using at most ten samples per search. We have released a reference implementation of Condensa at https://github.com/NVlabs/condensa.

Authors (5)

Vinu Joseph (6 papers)
Saurav Muralidharan (14 papers)
Animesh Garg (129 papers)
Michael Garland (13 papers)
Ganesh Gopalakrishnan (27 papers)

Citations (10)

View on Semantic Scholar

Summary

A Programmable Approach to Neural Network Compression

The paper "A Programmable Approach to Neural Network Compression" introduces an innovative framework known as Condensa, aimed at automating and optimizing neural network compression. Neural networks, notorious for their high parameter count and necessity for high precision, often possess a degree of redundancy that can be exploited to reduce both their memory footprint and computational demands without significant loss of accuracy. Model compression techniques such as weight pruning and quantization play a pivotal role here; however, the challenge has traditionally been to determine the most effective compression strategy and target sparsity.

Condensa, outlined in this work, seeks to alleviate these challenges by providing a programmable environment, allowing users to specify compression strategies using concise Python code. This environment is enhanced through a novel Bayesian optimization algorithm that autonomously infers optimal sparsity levels based on user-defined objectives. The most significant strength lies in its ability to optimize model compression across various deep neural network architectures and hardware platforms.

Key Contributions

The authors present Condensa as a framework that facilitates the expression of comprehensive compression schemes. Users are empowered to script their compression strategies using Python operators, which can then be assembled to suit specific architectural and hardware requirements.
A core feature of Condensa is its Bayesian optimization-based algorithm, which stands out for its sample efficiency, allowing the framework to maintain optimal performance while minimizing computational cost.
The paper reports impressive empirical results, demonstrating memory footprint reductions up to 188x and runtime throughput improvements of up to 2.59x with a mere maximum of ten sampling points. Such results underscore the power of the proposed method in efficiently navigating the compression space.
The authors introduce a uniquely efficient acquisition function, Domain-Restricted Upper Confidence Bound (DR-UCB), to proficiently home in on the sparsity that optimally satisfies accuracy constraints, further enhancing the framework's utility in practical scenarios.

Significance and Implications

Condensa's capability to automate the selection of compression parameters presents significant implications. In practice, it diminishes the manual trial-and-error process traditionally associated with model compression, thereby accelerating the deployment of efficient neural networks in resource-constrained environments such as mobile and edge devices. This attribute becomes increasingly valuable as models grow in complexity and size.

From a theoretical perspective, the integration of Bayesian optimization within the compression framework paves the way for further research into hyperparameter tuning for deep learning architectures. As neural networks expand to solve more specialized tasks, there is an imperative need to refine these iterative processes to save computational resources and energy.

Future Directions

The research opens several avenues for enhancement and exploration. Future work might focus on extending Condensa's capabilities to consider additional hyperparameters such as quantization data types and compression of non-parameter components like activations and batch normalization layers. Another promising direction includes combining Condensa with automated machine learning frameworks, further expanding its utility in developing efficient, scalable networks.

In conclusion, this paper presents a thoroughly substantiated solution to model compression, offering a programmable framework that effectively reduces the manual burden on researchers while delivering credible performance improvements. Condensa exemplifies a forward-thinking approach to making neural network compression robust, accessible, and adaptive to diverse deployment environments.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - NVlabs/condensa: Programmable Neural Network Compression (148 stars)