Soft Weight-Sharing for Neural Network Compression (1702.04008v2)

Published 13 Feb 2017 in stat.ML and cs.LG

Abstract: The success of deep learning in numerous application domains created the de- sire to run and train them on mobile devices. This however, conflicts with their computationally, memory and energy intense nature, leading to a growing interest in compression. Recent work by Han et al. (2015a) propose a pipeline that involves retraining, pruning and quantization of neural network weights, obtaining state-of-the-art compression rates. In this paper, we show that competitive compression rates can be achieved by using a version of soft weight-sharing (Nowlan & Hinton, 1992). Our method achieves both quantization and pruning in one simple (re-)training procedure. This point of view also exposes the relation between compression and the minimum description length (MDL) principle.

Citations (400)

View on Semantic Scholar

Summary

The paper presents a retraining procedure that integrates Gaussian mixture models as priors to cluster weights into shared groups.
The approach simultaneously achieves quantization and pruning, compressing models up to 64.2x with minimal accuracy loss.
The method enables efficient deployment of deep learning models on resource-constrained devices by reducing computational and memory demands.

Analyzing Soft Weight-Sharing Techniques for Neural Network Compression

The paper "Soft Weight-Sharing for Neural Network Compression" presents a novel approach for compressing neural networks by exploiting the concepts of soft weight-sharing. The authors focus on reducing the computational, memory, and energy requirements of deep neural networks, which are often substantial due to their typically large parameter spaces. The compression of neural networks has significant implications, especially for deploying models on resource-constrained devices like mobile phones.

Key Contributions and Methodology

The method proposed by Ullrich, Meeds, and Welling involves a retraining procedure using a mixture of Gaussians as a prior over the network weights. The primary goal is to cluster the weights into a set of predefined groups, thereby achieving quantization and pruning simultaneously in a simplified retraining process. This approach diverges from traditional techniques that typically perform pruning, quantization, and retraining as separate stages.

Theoretical Foundations:

The paper draws upon the Minimum Description Length (MDL) principle, which frames model compression as a trade-off between model complexity and fitting accuracy. By combining variational inference with the MDL principle, the authors reinterpret weight compression as a variational learning problem. This theoretical underpinning aligns compression and model regularization objectives, arguing that a suitable prior can facilitate encoding network parameters with minimal loss of predictive performance.

Soft Weight-Sharing Mechanism:

The soft weight-sharing mechanism uses a Gaussian Mixture Model (GMM) to encode neuronal weights, where each weight is assigned a cluster indicative of a distribution centered around mean values optimized during the training process. This clustering leads to weights "sharing" the same values, effectively reducing the number of distinct parameters. The authors address the intricacies involved in initializing mixture model components and detailing a factorized Dirac posterior approach for retraining pre-trained networks, thus simplifying the integration with existing models.

Experimental Evaluation

The proposed method was evaluated on different neural network architectures, including LeNet models and a light version of ResNet. The experiments demonstrated that the soft weight-sharing method could achieve state-of-the-art compression ratios while maintaining competitive accuracy. For LeNet-300-100, the method compressed the model by a factor of 64.2 times without significant loss of accuracy. Gainful integration of hyper-priors, such as Gamma and Beta distributions, further enhanced the model's flexibility and compressibility.

Implications and Future Work

The paper offers a fresh perspective on model compression by synergizing variance reduction and prior-based encoding. This contribution is a step towards democratizing deep learning models, allowing them to run efficiently on edge devices with significant resource constraints. Notably, the idea of learning the prior alongside weights introduces additional flexibility in the model compression ecosystem.

However, the computational cost and the sophisticated implementation of the soft weight-sharing method pose challenges, particularly for large models with upwards of millions of parameters. The authors suggest potential scalability solutions, including stochastic optimization strategies for handling larger parameter spaces more efficiently.

Speculative Future Directions

Future work could extend beyond Dirac distributions and explore more comprehensive and adaptive Bayesian models to support complex network architectures. Additionally, incorporating structured pruning at different network levels, such as convolutional filters, could leverage this compression method's strengths, enabling speed enhancements and further memory savings. Furthermore, the method holds promise for training networks from scratch, potentially obviating the need for large pre-trained models.

In summary, this paper presents a compelling method for neural network compression using soft weight-sharing, underlined by strong theoretical support from MDL and Bayesian inference principles. The approach achieves impressive compression rates with minimal accuracy loss, offering a tractable solution for deploying deep learning models in computationally constrained environments.

PDF Markdown