Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding (1510.00149v5)

Published 1 Oct 2015 in cs.CV and cs.NE
Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Abstract: Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.

Deep Compression (Han et al., 2015 ) introduces a three-stage pipeline designed to significantly reduce the storage footprint of deep neural networks, making them practical for deployment on resource-constrained devices like mobile phones and embedded systems. The core idea is to combine pruning, trained quantization with weight sharing, and Huffman coding to achieve high compression rates without sacrificing model accuracy. The paper demonstrates compression factors of 35×35\times to 49×49\times on prominent networks like AlexNet and VGG-16.

The motivation behind Deep Compression stems from the observation that state-of-the-art neural networks are often computationally and memory-intensive. Large model sizes pose challenges for mobile application deployment (due to download size constraints) and energy efficiency (due to costly off-chip DRAM accesses). The paper highlights that DRAM access is orders of magnitude more energy-consuming than on-chip SRAM access or computation. Therefore, reducing model size such that it fits into on-chip caches is a primary goal.

Here are the practical implementation details and applications of each stage:

1. Network Pruning

Goal: Remove redundant connections (weights) that contribute little to the network's accuracy. Method:

  1. Train the network: Start with a pre-trained, dense neural network.
  2. Prune small weights: Identify and remove connections whose absolute weight values are below a predefined threshold. This threshold is typically chosen based on the desired sparsity level.
  3. Retrain the network: Fine-tune the weights of the remaining sparse connections. This step is crucial to recover any accuracy loss introduced by pruning. Implementation:
  • Pruning results in sparse weight matrices. These sparse structures must be stored efficiently. Common formats are Compressed Sparse Row (CSR) or Compressed Sparse Column (CSC).
  • The paper proposes storing the index difference between non-zero elements instead of absolute indices. For typical sparse networks where non-zero elements are relatively close, this difference is small and can be encoded with fewer bits (e.g., 8 bits for convolutional layers, 5 bits for fully connected layers). Filler zeros are used if the difference exceeds the encoding capacity for a given number of bits.
  • During retraining, a mask can be applied to the weight updates to ensure that the pruned connections remain zero. Results: Pruning alone reduces the number of weights by 9×9\times to 13×13\times for AlexNet and VGG-16, respectively, while maintaining accuracy. This significantly reduces the storage required for the weight values themselves, but introduces the overhead of storing indices.

2. Trained Quantization and Weight Sharing

Goal: Reduce the number of bits required to represent each remaining weight by having multiple connections share the same weight value. Method:

  1. Apply k-means clustering: After pruning and retraining, apply one-dimensional k-means clustering to the non-zero weights in each layer (independently per layer). Each cluster centroid represents a shared weight value.
  2. Weight Sharing: Replace each weight with the index of the cluster centroid it belongs to. Only the set of centroid values (the "codebook") and the indices for each weight need to be stored.
  3. Fine-tune shared weights: Retrain the network again, but this time the training updates the centroid values instead of individual weights. Gradients for all weights belonging to a cluster are summed and used to update the cluster's centroid. Implementation:
  • The network weights are replaced by indices into a codebook (a small array of shared weight values).
  • During feed-forward and back-propagation, there is an indirect lookup step to retrieve the actual weight value using the index.
  • The gradient update rule for a centroid CkC_k is $\frac{\partial \cal{L}}{\partial C_k} = \sum_{i,j} \frac{\partial \cal{L}}{\partial W_{ij}} \mathds{1}(I_{ij}=k)$, where WijW_{ij} is the weight at index (i,j)(i,j), IijI_{ij} is its centroid index, and $\mathds{1}(.)$ is the indicator function.
  • The paper investigates centroid initialization methods (Forgy, density-based, linear) and finds that linear initialization generally yields the best accuracy by ensuring large weights are better represented (Han et al., 2015 ). Results: Quantization further compresses the pruned network. For AlexNet, weights are quantized to 8 bits for CONV layers and 5 bits for FC layers. For VGG-16, 8 bits for CONV and 5 bits for FC. This reduces the bits per non-zero weight from 32-bit floating-point to 5-8 bits. The combination of pruning and quantization achieves compression rates of 27×27\times to 31×31\times for ImageNet models without accuracy loss.

3. Huffman Coding

Goal: Leverage the non-uniform distribution of the quantized weights and sparse index differences for additional lossless compression. Method: Apply standard Huffman coding to the bitstreams representing the quantized weight indices and the sparse index differences. Symbols that appear more frequently are assigned shorter codewords. Implementation: This is typically an offline step applied after the pruning and quantization fine-tuning are complete. The Huffman codebook must be stored along with the compressed model data. Results: Huffman coding provides an additional 20%30%20\%-30\% compression, boosting the total compression rate from 27×27\times (pruning + quantization) to 35×35\times for AlexNet and from 31×31\times to 49×49\times for VGG-16 (Han et al., 2015 ).

Combined Results and Practical Implications

The three stages of Deep Compression work synergistically. Pruning reduces the number of weights, making quantization more effective as there are fewer values to cluster. Quantization reduces the bits per remaining weight. Huffman coding provides a final lossless squeeze. The paper shows that pruning and quantization combined achieve much higher compression before accuracy drops compared to applying either method alone (Figure 6).

The resulting compressed models are significantly smaller. AlexNet reduces from 240MB to 6.9MB, and VGG-16 from 552MB to 11.3MB. These sizes are small enough to potentially fit into on-chip SRAM caches, drastically reducing energy consumption during inference by avoiding costly DRAM accesses. This is particularly beneficial for real-time, low-latency applications (batch size = 1) common in embedded and mobile contexts.

The paper benchmarks the pruned network (as off-the-shelf libraries don't fully support the quantized sparse format) and shows notable speedups (3×3\times to 4×4\times) and energy efficiency improvements (3×3\times to 7×7\times) for batch size=1 on CPU, GPU, and mobile GPU (Tegra K1). This demonstrates the practical performance benefits of a reduced memory footprint and sparsity, even without fully exploiting the weight sharing at the hardware/library level.

Time (us) AlexNet FC6 Dense (batch=1) AlexNet FC6 Sparse (batch=1) VGG16 FC6 Dense (batch=1) VGG16 FC6 Sparse (batch=1)
Titan X 541.5 134.8 1467.8 167.0
Core i7-5930k 7516.2 3066.5 35022.8 3774.3
Tegra K1 12437.2 2879.3 35427.0 4377.2
Power (Watts) AlexNet FC6 Dense (batch=1) AlexNet FC6 Sparse (batch=1) VGG16 FC6 Dense (batch=1) VGG16 FC6 Sparse (batch=1)
Titan X 157 181 166 189
Core i7-5930k 83.5 42.3 70.6 38.0
Tegra K1 5.1 5.9 5.3 5.6

Detailed timings and power consumption values for specific layers and batch sizes can be found in Appendix A of the paper (Han et al., 2015 ), reproduced in Table 4 and Table 5 above.

Implementation Considerations and Future Work

The paper notes that fully realizing the speedup and energy efficiency potential of the quantized and sparse network requires specialized hardware or optimized software libraries that can efficiently handle the indirect lookup via indices and the compressed index representations. Standard libraries like cuSPARSE or MKL SPBLAS primarily support basic sparse formats (like CSR/CSC) but may not optimize for the specific structures resulting from trained quantization and index difference encoding. This motivated subsequent work on dedicated hardware accelerators like EIE [han2016eie] tailored for such compressed models.

In summary, Deep Compression (Han et al., 2015 ) provides a powerful, multi-stage pipeline for compressing deep neural networks for deployment on resource-constrained platforms. Its practical impact lies in enabling large models to fit into limited memory, reducing bandwidth requirements, and improving energy efficiency, particularly crucial for real-time mobile and embedded AI applications. The techniques described are practical steps that can be implemented using existing deep learning frameworks with careful management of sparse data structures and retraining phases.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Song Han (155 papers)
  2. Huizi Mao (13 papers)
  3. William J. Dally (21 papers)
Citations (8,372)
Youtube Logo Streamline Icon: https://streamlinehq.com