Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learned Step Size Quantization (1902.08153v3)

Published 21 Feb 2019 in cs.LG and stat.ML

Abstract: Deep networks run with low precision operations at inference time offer power and space advantages over high precision alternatives, but need to overcome the challenge of maintaining high accuracy as precision decreases. Here, we present a method for training such networks, Learned Step Size Quantization, that achieves the highest accuracy to date on the ImageNet dataset when using models, from a variety of architectures, with weights and activations quantized to 2-, 3- or 4-bits of precision, and that can train 3-bit models that reach full precision baseline accuracy. Our approach builds upon existing methods for learning weights in quantized networks by improving how the quantizer itself is configured. Specifically, we introduce a novel means to estimate and scale the task loss gradient at each weight and activation layer's quantizer step size, such that it can be learned in conjunction with other network parameters. This approach works using different levels of precision as needed for a given system and requires only a simple modification of existing training code.

Learned Step Size Quantization

The paper "Learned Step Size Quantization" by Esser et al. introduces a novel method for training deep networks using low precision quantization of weights and activations. This approach, referred to as Learned Step Size Quantization (LSQ), demonstrates significant improvements in maintaining high accuracy on the ImageNet dataset, even with reduced precision levels of 2, 3, and 4 bits.

Key Contributions

  1. Quantizer Step Size Learning: LSQ introduces a methodology to learn the quantizer's step size during training by estimating and scaling the task loss gradient at each layer. This learning is integrated with other network parameters, enhancing quantization mapping precision.
  2. Gradient Approximation: A novel approximation of the quantizer's gradient sensitive to quantized state transitions is provided. This results in refined optimization capabilities compared to prior methods, which largely overlooked such transitions.
  3. Gradient Scale Optimization: The paper proposes a heuristic to align the magnitude of step size updates with weight updates, facilitating consistent convergence. The step sizes are initialized and optimized depending on the network's layer size and required precision.

Numerical Results and Comparison

LSQ achieves the highest reported accuracies for various network architectures at 2, 3, and 4 bits on ImageNet. Notably, it reaches full precision baseline accuracy for 3-bit models, marking a significant milestone in model quantization:

  • ResNet-18: Achieved 67.6% top-1 accuracy at 2-bit precision, outperforming previous approaches like QIL and LQ-Nets.
  • ResNet-34 and ResNet-50: Exhibited top-1 accuracies of 71.6% and 76.7% at 3-bit precision, respectively.

Methodological Insights

The LSQ method leverages backpropagation and stochastic gradient descent while incorporating a custom gradient to address discontinuities in the quantization process. The quantization of weights and activations facilitates the use of integer low precision operations, reducing computation and memory requirements.

Implications and Future Directions

The success of LSQ in achieving competitive accuracy with low precision models underscores its potential for widespread adoption. The approach not only reduces memory footprints but also aligns with increasing industrial demand for energy-efficient AI models. Future developments could explore extending the LSQ framework to other tasks and model architectures. Additionally, integration within edge devices and real-time systems could provide further insights into its practical applications.

LSQ's ability to break through prior accuracy limitations sets a precedent for further research in gradient-aware quantization methods, potentially driving innovations in low precision network deployment for industrial applications. The capability of 3-bit models to reach full precision accuracy, especially with the aid of knowledge distillation, suggests a promising trajectory for future advances in the quantization of neural networks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Steven K. Esser (5 papers)
  2. Jeffrey L. McKinstry (5 papers)
  3. Deepika Bablani (4 papers)
  4. Rathinakumar Appuswamy (8 papers)
  5. Dharmendra S. Modha (6 papers)
Citations (698)