Training with Quantization Noise for Extreme Model Compression (2004.07320v3)

Published 15 Apr 2020 in cs.LG and stat.ML

Abstract: We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods where the approximations introduced by STE are severe, such as Product Quantization. Our proposal is to only quantize a different random subset of weights during each forward, allowing for unbiased gradients to flow through the other weights. Controlling the amount of noise and its form allows for extreme compression rates while maintaining the performance of the original model. As a result we establish new state-of-the-art compromises between accuracy and model size both in natural language processing and image classification. For example, applying our method to state-of-the-art Transformer and ConvNet architectures, we can achieve 82.5% accuracy on MNLI by compressing RoBERTa to 14MB and 80.0 top-1 accuracy on ImageNet by compressing an EfficientNet-B3 to 3.3MB.

Authors (7)

Angela Fan (49 papers)
Pierre Stock (19 papers)
Benjamin Graham (27 papers)
Edouard Grave (56 papers)
Armand Joulin (81 papers)
Remi Gribonval (6 papers)
Herve Jegou (6 papers)

Citations (227)

View on Semantic Scholar

Summary

The paper introduces Quant-Noise, a novel training approach that randomly quantizes a subset of weights to maintain gradient integrity and enhance model robustness.
It combines Quant-Noise with advanced techniques like Product Quantization to achieve high compression rates while preserving accuracy.
The study demonstrates that extreme model compression can be practically achieved for resource-constrained devices without significant performance loss.

Training with Quantization Noise for Extreme Model Compression

The paper presents an innovative approach to extreme model compression by introducing a method called Quant-Noise during training. This method specifically addresses the challenge of maintaining model accuracy after high levels of quantization, which is critical for deploying deep learning models in resource-constrained environments like mobile devices or embedded systems.

Technical Overview

Quant-Noise extends the traditional practice of Quantization Aware Training (QAT) by only quantizing a randomly selected subset of weights during each forward pass. This stochastic selection allows unbiased gradients to flow through the remaining, unquantized weights, thus preserving model robustness to severe quantization noise. The method aims to control the noise introduced during training so that the model becomes inherently resilient to the quantization methods applied at inference time.

The authors notably apply Quant-Noise to high-accuracy quantization methods such as Product Quantization (PQ). By doing so, they mitigate the significant approximation errors that typically accompany high compression regimes. Their results demonstrate new state-of-the-art efficiencies, achieving compression rates that are formidable without sacrificing much performance. Specifically, the paper reports a 16-layer Transformer achieving a perplexity (PPL) of 21.8 on the Wikitext-103 dataset when compressed to only 38 MB and an EfficientNet-B3 achieving 80.0% top-1 accuracy on ImageNet compressed to a mere 3.3 MB.

Numerical Results and Claims

The experimental results highlight that Quant-Noise is effective across various quantization approaches, including int8, int4, and PQ. Notably, combining Quant-Noise with PQ leads to substantial increases in compression rates while maintaining impressive accuracy. For instance, using Quant-Noise in conjunction with PQ allows achieving a compression rate of 25x with minimal impact on the Transformer model's perplexity. The paper suggests that such extreme compression is possible due to the regularization properties of Quant-Noise, likened to DropConnect and LayerDrop, which ensure that the network generalizes well despite the introduction of quantization noise during training.

Implications and Future Directions

Practically, the findings of this paper have significant implications for deploying AI models in environments where computational resources and storage capacities are limited. The ability to achieve high compression rates with minimal accuracy loss opens up new avenues for AI applications in mobile computing, IoT, and edge devices.

Theoretically, the introduction of Quant-Noise as a means of promoting robustness against quantization noise presents an interesting paradigm. It suggests that training architectures can be inherently prepared for extreme pruning and compression without needing extensive post-processing adjustments. This concept potentially shifts how we think about model optimization and efficiency during the model training phase itself.

Future work could explore integrating Quant-Noise with other advanced model compression techniques like automated neural architecture search (NAS) or dynamic neural network reconfiguration. Moreover, examining the compatibility of Quant-Noise with other forms of network regularization could yield insights into creating more flexible and adaptive machine learning models capable of dynamic adjustments based on resource availability.

In summary, by introducing Quant-Noise, the authors offer a mathematically elegant and practically effective method for model compression, making significant strides towards efficient deep learning model deployment in constrained environments.

PDF Markdown