Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets (1903.05662v4)

Published 13 Mar 2019 in cs.LG, math.OC, and stat.ML

Abstract: Training activation quantized neural networks involves minimizing a piecewise constant function whose gradient vanishes almost everywhere, which is undesirable for the standard back-propagation or chain rule. An empirical way around this issue is to use a straight-through estimator (STE) (Bengio et al., 2013) in the backward pass only, so that the "gradient" through the modified chain rule becomes non-trivial. Since this unusual "gradient" is certainly not the gradient of loss function, the following question arises: why searching in its negative direction minimizes the training loss? In this paper, we provide the theoretical justification of the concept of STE by answering this question. We consider the problem of learning a two-linear-layer network with binarized ReLU activation and Gaussian input data. We shall refer to the unusual "gradient" given by the STE-modifed chain rule as coarse gradient. The choice of STE is not unique. We prove that if the STE is properly chosen, the expected coarse gradient correlates positively with the population gradient (not available for the training), and its negation is a descent direction for minimizing the population loss. We further show the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem. Moreover, we show that a poor choice of STE leads to instability of the training algorithm near certain local minima, which is verified with CIFAR-10 experiments.

PDF Abstract

Overview of "Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets"

The paper "Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets" explores the theoretical underpinnings of using the Straight-Through Estimator (STE) when training activation quantized neural networks. This work navigates the complex landscape of optimizing piecewise constant functions that characterize quantized networks. The inherent challenge in such functions is that their gradient vanishes almost everywhere, rendering traditional back-propagation techniques ineffective.

Key Contributions

The authors provide a detailed theoretical justification for employing STEs, focusing on why moving in the negative direction of what they term the "coarse gradient"—a substitute for the true gradient—leads to minimized training loss. They rigorously analyze learning in a model comprising a two-linear-layer network with a binarized ReLU activation and Gaussian input data.

Key theoretical contributions of the paper include:

Coarse Gradient as a Descent Direction: The paper proves that with a properly chosen STE, the expected coarse gradient aligns positively with the population gradient, thereby serving as a valid direction for minimizing the population loss.
Convergence Analysis: They demonstrate that the associated coarse gradient descent algorithm converges to a critical point of the population loss minimization problem, provided that the STE is appropriately selected.
Assessment of STE Choices: It is revealed that poor choices of STE can impair the stability of training, especially near certain local minima, a claim supported through empirical experiments on CIFAR-10.

Numerical Results and Empirical Validation

Through a series of experiments, the authors validate their theoretical claims. For instance, they emphasize that poor choices of STE may destabilize the training process, as evidenced by observed divergences at local minima in CIFAR-10 training scenarios. Conversely, they report successful convergence when employing appropriate STEs, highlighting the practical feasibility of their theoretical findings.

Implications and Future Directions

The implications of this research are significant for the design of quantized networks. Understanding the efficacy of STEs enhances the ability to optimize networks with reduced precision, resulting in lower memory and energy consumption. This is particularly impactful for deployment in resource-constrained environments, such as mobile devices.

The paper lays the groundwork for continued exploration into the optimization of quantized neural networks through alternative STE methodologies. Future research could expand upon differentiating the behavior of various STEs across more complex, deeper network architectures, and further explore stochastic variants for broader applicability.

Conclusion

In summary, the work presented provides a crucial theoretical framework for understanding and applying STEs in the training of quantized neural networks. The convergence guarantees and insights into the impact of different STE choices deepen the understanding of neural network training in constrained environments. This paper contributes a meaningful step towards more efficient neural network models and sets the stage for future advances in computational efficiency through quantization techniques.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Penghang Yin (25 papers)
Jiancheng Lyu (15 papers)
Shuai Zhang (319 papers)
Stanley Osher (104 papers)
Yingyong Qi (20 papers)
Jack Xin (85 papers)

Citations (276)

View on Semantic Scholar