Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks (1909.13144v2)

Published 28 Sep 2019 in cs.LG and stat.ML

Abstract: We propose Additive Powers-of-Two~(APoT) quantization, an efficient non-uniform quantization scheme for the bell-shaped and long-tailed distribution of weights and activations in neural networks. By constraining all quantization levels as the sum of Powers-of-Two terms, APoT quantization enjoys high computational efficiency and a good match with the distribution of weights. A simple reparameterization of the clipping function is applied to generate a better-defined gradient for learning the clipping threshold. Moreover, weight normalization is presented to refine the distribution of weights to make the training more stable and consistent. Experimental results show that our proposed method outperforms state-of-the-art methods, and is even competitive with the full-precision models, demonstrating the effectiveness of our proposed APoT quantization. For example, our 4-bit quantized ResNet-50 on ImageNet achieves 76.6% top-1 accuracy without bells and whistles; meanwhile, our model reduces 22% computational cost compared with the uniformly quantized counterpart. The code is available at https://github.com/yhhhli/APoT_Quantization.

Additive Powers-of-Two Quantization: An Efficient Non-Uniform Discretization for Neural Networks

The paper presents a novel approach to quantization in neural networks that is both efficient and adaptable to the weight distributions observed in real-world neural networks, specifically focusing on the bell-shaped and long-tailed distributions that are prevalent. The proposed method, termed Additive Powers-of-Two (APoT) quantization, aims to improve both computational efficiency and accuracy compared to existing quantization techniques.

APoT quantization works by mapping weights and activations in neural networks to quantization levels that are sums of Powers-of-Two (PoT) terms. This approach enables a non-uniform distribution of quantization levels which better fits the typical distribution of weights, thereby reducing quantization error. Notably, APoT achieves approximately twice the multiplication speed-up over traditional uniform quantization, leveraging the computational simplicity of operations based on powers of two.

Key contributions of the paper include:

  1. APoT Quantization Scheme: By formulating quantization levels as sums of PoT terms, the authors provide a scheme that accommodates the non-uniform distributions of weights, resulting in a finer granularity where needed. This results in a significant reduction in computational cost and a notable improvement in model accuracy, bringing the performance of quantized models closer to their full-precision counterparts.
  2. Reparameterized Clipping Function (RCF): The authors introduce a modified clipping function that provides a more accurate gradient for optimizing the clipping threshold, a critical parameter in the quantization process that defines the range of values considered during discretization. This approach allows for better optimization and thus more accurate modeling during neural network training.
  3. Weight Normalization: Weight normalization is implemented to stabilize the training process by ensuring weights are consistent and have zero mean and unit variance. This normalization aids in reducing the perturbations in weight distribution that can occur through training, contributing to smoother and more effective learning of the clipping threshold.

The paper provides empirical evidence supporting the efficacy of these methods, with experimental results showing that APoT quantized models outperform several state-of-the-art quantization techniques. For instance, a 4-bit quantized ResNet-50 on ImageNet achieves Top-1 accuracy of 76.6%, while reducing computational cost by 22% compared to uniformly quantized models. These performance metrics indicate the potential for APoT quantization to facilitate the deployment of neural networks on resource-constrained devices without substantial loss in accuracy.

Implications and Future Directions

From a theoretical perspective, the paper adds to the body of work on non-uniform quantization methods, highlighting the importance of aligning quantization levels with the inherent distribution of weights in neural networks. Practically, the proposed APoT quantization presents a viable solution for deploying deep learning models in edge computing scenarios where computational resources and power consumption are at a premium.

Looking forward, the paper opens several avenues for future research. Extending the framework to other neural network architectures, including recurrent and transformer models, might reveal further insights into the applicability and generalization of APoT quantization. Additionally, exploring adaptive methods for dynamically tuning the bit-width during training could enhance efficiency even further, particularly in heterogeneous computing environments. The integration of APoT with advanced hardware accelerators could also optimize latency and energy efficiency, broadening the scope for industrial applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yuhang Li (102 papers)
  2. Xin Dong (90 papers)
  3. Wei Wang (1793 papers)
Citations (241)
Github Logo Streamline Icon: https://streamlinehq.com