Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scalable Methods for 8-bit Training of Neural Networks (1805.11046v3)

Published 25 May 2018 in cs.LG and stat.ML

Abstract: Quantized Neural Networks (QNNs) are often used to improve network efficiency during the inference phase, i.e. after the network has been trained. Extensive research in the field suggests many different quantization schemes. Still, the number of bits required, as well as the best quantization scheme, are yet unknown. Our theoretical analysis suggests that most of the training process is robust to substantial precision reduction, and points to only a few specific operations that require higher precision. Armed with this knowledge, we quantize the model parameters, activations and layer gradients to 8-bit, leaving at a higher precision only the final step in the computation of the weight gradients. Additionally, as QNNs require batch-normalization to be trained at high precision, we introduce Range Batch-Normalization (BN) which has significantly higher tolerance to quantization noise and improved computational complexity. Our simulations show that Range BN is equivalent to the traditional batch norm if a precise scale adjustment, which can be approximated analytically, is applied. To the best of the authors' knowledge, this work is the first to quantize the weights, activations, as well as a substantial volume of the gradients stream, in all layers (including batch normalization) to 8-bit while showing state-of-the-art results over the ImageNet-1K dataset.

Scalable Methods for 8-bit Training of Neural Networks

This paper addresses the challenge of implementing efficient and scalable methods for low-bit quantization in neural network training, with a particular focus on 8-bit training. The authors meticulously analyze various quantization approaches, including binary, ternary, and n-bit quantization, providing theoretical foundations for understanding the geometric implications of these methods through cosine similarity analysis between quantized and original weight vectors.

Theoretical Foundations of Quantization

The paper begins with the mathematical formulation of cosine similarity as a measure of correlation between original weights and their quantized versions. In the binary quantization approach, weights are mapped onto {-1, 1}, and the analysis establishes that the expected angle between original and quantized vectors is upper bounded by approximately 37 degrees. Ternary quantization introduces a third state, zero, determined by a threshold tt. The results outline that the optimal angle – meaning a stronger correlation – is achieved with a threshold of 0.6 standard deviations. This is a significant insight as it indicates a practical balance between quantization efficiency and loss of information.

For n-bit quantization, the paper explores the task of quantizing weights with a uniform error distribution. By modeling the quantization error, the authors derive that increased precision bits result in a more favorable error distribution and, consequently, a smaller angle between the original and quantized weights. This theoretically substantiate why higher precision quantization can mitigate errors due to accumulated noise in the training process.

Practical Implementations and Implications

The authors incorporate the GMMLOWP quantization method within the implementation, demonstrating that with a well-established scheme, quantization achieves satisfactory results when integrated into back-propagation algorithms. This work underscores the importance of using stochastic rounding during quantization to avoid cumulative errors, highlighting the advantage of unbiased schemes in maintaining model accuracy.

In terms of experimental validation, this paper tests the aggressive quantization approach of the Quantized Back-Propagation (QBP) using various standard benchmarks such as CIFAR10 and ImageNet datasets. These experiments demonstrate the feasibility of QBP and its variant, Ternarized Back-Propagation (TBP), which notably reduces computational power by substituting MAC operations with XOR operations. The results reveal that inflating models' layers adequately can mitigate performance losses typically associated with lower precision models.

Results and Future Prospects

The numerical results presented exhibit narrow performance gaps between low-precision and full-precision models when using widened architectures, suggesting scalability potential without significant computational overheads. The training of a TBP network shows that increasing filter sizes satisfactorily compensates for the precision loss, making this a viable strategy for future AI systems designed for resource-constrained environments.

Overall, this paper paves the way for future research into optimizing quantization methods for neural networks, with theoretical insights backing practical implementations. While the quantization strategy efficiently reduces the bit-width of training without sacrificing performance considerably, advancements in adaptive quantization mechanisms may offer further efficiencies. Future exploration into mixed-precision training and hardware-designed quantization algorithms, perhaps in tandem with this theoretical foundation, can drive the development of more efficient AI models adaptable to increasingly complex tasks and environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ron Banner (20 papers)
  2. Itay Hubara (19 papers)
  3. Elad Hoffer (23 papers)
  4. Daniel Soudry (76 papers)
Citations (315)