Beyond Discreteness: Finite-Sample Analysis of Straight-Through Estimator for Quantization (2505.18113v1)

Published 23 May 2025 in cs.LG and math.OC

Abstract: Training quantized neural networks requires addressing the non-differentiable and discrete nature of the underlying optimization problem. To tackle this challenge, the straight-through estimator (STE) has become the most widely adopted heuristic, allowing backpropagation through discrete operations by introducing surrogate gradients. However, its theoretical properties remain largely unexplored, with few existing works simplifying the analysis by assuming an infinite amount of training data. In contrast, this work presents the first finite-sample analysis of STE in the context of neural network quantization. Our theoretical results highlight the critical role of sample size in the success of STE, a key insight absent from existing studies. Specifically, by analyzing the quantization-aware training of a two-layer neural network with binary weights and activations, we derive the sample complexity bound in terms of the data dimensionality that guarantees the convergence of STE-based optimization to the global minimum. Moreover, in the presence of label noises, we uncover an intriguing recurrence property of STE-gradient method, where the iterate repeatedly escape from and return to the optimal binary weights. Our analysis leverages tools from compressed sensing and dynamical systems theory.

Summary

Finite-Sample Analysis of the Straight-Through Estimator for Quantization

Quantization is a critical area for deploying deep neural networks (DNNs) on resource-constrained devices, making them suitable for tasks involving low computational power and memory usage, such as on smartphones and IoT devices. However, training quantized neural networks presents mathematical challenges due to the discrete nature of weights and activations, which are not differentiable by traditional optimization methods. The straight-through estimator (STE) is a widely used heuristic to tackle these challenges by introducing surrogate gradients. Despite its empirical success, theoretical understanding of STE, especially within finite-sample settings, has remained limited.

This paper introduces the first finite-sample analysis of the STE in the context of neural network quantization. The authors focus on quantization-aware training (QAT) of a two-layer neural network with binary weights and activations, offering a rigorous paper that departs from the unrealistic assumption of infinite data samples often seen in prior works. The key contributions can be summarized as follows:

Sample Complexity Bound: The analysis demonstrates that $O(n^2)$ samples are sufficient for the convergence of ergodic (averaged) iterates to the optimal solution, while $O(n^4)$ samples suffice for non-ergodic (last iterate) convergence. Here, $n$ is the data dimension. The authors validate the $O(n^2)$ bound for ergodic convergence empirically, suggesting its tightness.
Recurrence Behavior: The paper unveils an unexpected recurrence property within the STE-gradient method when label noise is present. The algorithm exhibits a cyclical pattern of repeatedly reaching and escaping the optimal binary weights, contrasting typical scenarios like linear regression where noise hinders exact recovery.
Analytical Tools: Supporting the analysis, the authors employ methods drawn from compressed sensing and dynamical systems theory. In particular, techniques from 1-bit compressed sensing reveal structural parallels with quantization challenges, while occupation time analysis traditionally used in dynamical systems provides insights into iterative dynamics.

The paper's findings have significant implications for understanding and optimizing quantization-aware training in practical scenarios. By establishing a finite-sample theoretical foundation for STE, these results highlight the importance of sample size while enabling more reliable training despite operational constraints. Future developments motivated by this research could focus on extending the finite-sample analytic framework to more complex neural architectures and exploring ways to further minimize sample complexity while maintaining robust convergence properties.

This research presents a move toward bridging the gap between empirical utility and theoretical fundamentals in quantized neural network training, offering valuable insights and tools for the continued evolution of artificial intelligence in resource-efficient environments.