Straight-through Estimators (STE)

Updated 11 January 2026

STE is a surrogate-gradient method that enables gradient-based optimization for non-differentiable or discrete operations in neural networks.
It uses true non-differentiable functions in the forward pass and continuous proxies in the backward pass to facilitate end-to-end training.
Variants of STE, including proxy derivatives and hybrid techniques, improve bias, stability, and accuracy in tasks like quantization-aware training and binarized networks.

A straight-through estimator (STE) is a surrogate-gradient method that enables gradient-based optimization over non-differentiable or discrete operations, most prominently in quantized, binarized, or stochastically discretized neural networks. The STE works by using the true, non-differentiable function in the forward pass (such as quantization or binarization), but in the backward pass, it replaces the intractable or zero Jacobian by a tractable surrogate—often the identity or another continuous proxy. This enables end-to-end backpropagation and stochastic gradient descent in neural networks with non-differentiable constraints, even though the obtained gradient is strictly an estimator, not the true gradient of the objective. Extensive research has formalized, analyzed, and extended STEs, yielding both foundational understanding and advanced practical methodologies (Shekhovtsov et al., 2020).

1. Theoretical Foundations of the Straight-Through Estimator

The classic STE was introduced for training neural nets with non-differentiable steps such as sign, binary quantization, or piecewise-constant functions. Formally, given a deterministic or stochastic quantizer $q(x)$ (e.g., $q(x) = \operatorname{sign}(x)$ or $q(x) = \Delta \cdot \mathrm{round}(x/\Delta)$ ), the chain rule in backpropagation fails because $\frac{\partial q}{\partial x}$ is zero almost everywhere or undefined. The STE circumvents this by declaring

$\frac{\partial q(x)}{\partial x} \approx 1$

or by choosing another continuous surrogate. In vector form, for $y = q(x)$ and loss $L(y)$ , the backward pass sets

$\frac{\partial L}{\partial x} \approx \frac{\partial L}{\partial y} \cdot \frac{\partial q(x)}{\partial x} \approx \frac{\partial L}{\partial y}$

For binary stochastic units, the STE can be systematically derived within the stochastic binary network (SBN) framework, where the expectation over discrete samples is approximated by a finite-difference and chain-rule surrogate, leading to closed-form expressions for the estimator (Shekhovtsov et al., 2020).

The STE's theoretical properties depend on the precise surrogate and the loss function. If the loss is multilinear in the discrete variables, the STE provides an unbiased estimator. With certain surrogates and loss structures (e.g., linear), the variance can be zero. However, for general nonlinear losses, the estimator is biased (Shekhovtsov et al., 2020, Yin et al., 2019).

2. Variants of STE and Extensions

Several generalizations and variants of the original STE have been developed for improved stability, bias-variance trade-off, and flexibility:

Proxy Derivatives: Instead of identity, surrogates like the hard-tanh ("clipped-linear"), tanh', or triangular-derivative are used. Each corresponds, under a stochastic-noise view, to a specific noise distribution injected into the quantizer (uniform, logistic, triangular, etc.) (Shekhovtsov et al., 2020, Spallanzani et al., 2022). The Additive Noise Annealing (ANA) framework shows that most STEs can be interpreted as expectations over quantizer outputs perturbed by structured noise, allowing for annealing and stochastic regularization (Spallanzani et al., 2022).
Rectified Straight-Through Estimator (ReSTE): Balances estimating error and gradient stability by parameterizing the surrogate with a power function, $f(z) = \operatorname{sign}(z) |z|^{1/o}$ . As $o$ increases, estimating error decreases but gradient instability increases; tuning $o$ yields optimal trade-offs (Wu et al., 2023).
Generalized STE (G-STE): For nonuniform and learnable quantization thresholds, G-STE employs a piecewise-linear surrogate whose gradient depends on the segment length, enabling backpropagation into threshold parameters. G-STE is necessary when learning nonuniform quantizers; it reduces to STE in the uniform case (Liu et al., 2021).
Gumbel-Softmax STE & Decoupled ST-GS: In discrete latent-variable models, the STE is used in tandem with the Gumbel-Softmax relaxation. Decoupled ST-GS introduces separate temperatures for forward (sampling) and backward (gradient) passes, optimizing the fidelity-bias trade-off and reducing the gradient gap (Shah et al., 2024).
Zeroth-Order and Hybrid Techniques: To reduce STE bias in extremely low-precision settings, hybrid approaches such as FOGZO combine the efficiency of STE with the unbiasedness of finite-difference (zeroth-order) gradients, using a small number of stochastic directions to correct the STE's bias (Yang et al., 27 Oct 2025).
Continuous Pruning Functions for Structured Sparsity: In N:M sparsity settings, discontinuous STEs are shown to cause optimization pathologies (direction errors, unpredictability, mask oscillation). Replacing them with continuous blockwise projections and appropriate scaling (as in S-STE) yields strictly improved convergence and stability (Hu et al., 2024).

3. Applications Across Quantized and Discrete Network Training

STE is foundational in a wide range of applications:

Quantization-Aware Training (QAT): Training with quantized weights or activations (often 1–4 bits), where standard backpropagation cannot pass gradients through rounding. STE enables training of such low-precision networks and remains dominant due to its efficiency and simplicity (Schoenbauer et al., 2024, Ichikawa et al., 12 Oct 2025).
Binarized Neural Networks (BNN): Both weights and activations are constrained to $\pm 1$ . STE enables training by passing surrogate gradients through $\operatorname{sign}$ or binary quantization. Specialized surrogates and equilibrium analyses optimize stability and accuracy (Wu et al., 2023, Yin et al., 2019).
Discrete Latent Variable Models: In VAEs and other generative models with categorical or binary latents, STE-type surrogates allow end-to-end differentiability by treating the sample operation as if it were a continuous, differentiable function during backpropagation (Fan et al., 2022).
Sparse and Structured Model Compression: In structured pruning (e.g., 2:4 or N:M sparsity), STE is used to propagate gradients through mask or blockwise sparsification; recent advances focus on replacing discontinuous pruning with continuous approximations (Hu et al., 2024, Mohamed et al., 2023).
Neuro-Symbolic Learning: STE enables injection of logical constraints into neural networks by providing surrogate gradients for non-differentiable logic-based operators in constraint loss functions (Yang et al., 2023).
Latent Structure Learning: In structured prediction with hard argmax or combinatorial latent variables, the STE provides a simple pulled-back gradient that approximates the effect of the discrete selection in the backward pass, often outperforming more elaborate surrogates in some unstructured domains (Mihaylova et al., 2020).

4. Theoretical Analysis: Bias, Sample Complexity, and Optimization Properties

Formal studies have rigorously analyzed STE's bias, efficiency, and conditions for descent:

Bias and Ascent Direction: If the loss is multilinear in the discrete variables (e.g., sums, products), STE is unbiased. For higher-order loss functions or when omitting the correct scaling (e.g., the factor 2 for $\pm1$ encoding), STE can introduce bias. Ascent-direction theorems show that, with suitable Lipschitzness and gradient dominance, surrogate gradients still yield descent in expectation (Shekhovtsov et al., 2020, Yin et al., 2019).
Sample Complexity: Recent finite-sample analyses demonstrate that to guarantee correct convergence in quantized models of dimensionality $n$ , one requires sample sizes scaling as $O(n^2)$ . Under label noise, iterates exhibit stochastic recurrence, escaping and revisiting the optimum infinitely often. These results link compressed sensing, dynamical systems, and quantized network theory (Jeong et al., 23 May 2025).
High-Dimensional Dynamics: In the high-dimensional limit, STE-based training dynamics concentrate on deterministic ODEs, exhibiting typical plateau phases followed by sharp drops in generalization error. The length of these plateaus and steady-state errors are explicitly governed by quantization range and bit-width (Ichikawa et al., 12 Oct 2025).
Mirror Descent and Wasserstein Flow: STE updates correspond to mirror descent steps on the probability simplex (using natural gradients or logit-space updates) in Bernoulli or categorical models and can be interpreted as projected Wasserstein gradient flows in the space of probability measures. These viewpoints provide formal global convergence guarantees under suitable conditions (Shekhovtsov et al., 2020, Cheng et al., 2019).

5. Empirical Performance, Limitations, and Improved Alternatives

Experimental validations cover a broad spectrum of settings:

Standard QAT and BNNs: STE baselines reach state-of-the-art or near-optimal accuracy in a variety of quantization and binarization tasks, often within 1–3% of full-precision benchmarks. Their competitive performance is robust across architectures and domains (Wu et al., 2023, Hu et al., 2024).
Limitations: At extreme quantization (1–2 bits per parameter) and in highly discrete settings, STE can stall, oscillate, or fail to guarantee descent. These effects manifest as optimization plateaus, direction misalignment, and unpredictable loss behavior (Malinovskii et al., 2024, Hu et al., 2024). Bias is more pronounced at low bit-widths or with poor surrogate choices (Yang et al., 27 Oct 2025, Yin et al., 2019).
Optimized Variants & Removal: Improved methods such as PV-Tuning, FOGZO, Gapped STE, and continuous blockwise projections demonstrably outperform vanilla STE, especially under stringent quantization or sparsity constraints (Malinovskii et al., 2024, Yang et al., 27 Oct 2025, Hu et al., 2024). In some cases, STE-free alternatives like Alpha-Blending can further improve final accuracy by maintaining valid gradient paths (Liu et al., 2019). However, many so-called "custom gradient" estimators are STE-equivalent under small learning rates or adaptive optimizers, as clarified in formal equivalence results (Schoenbauer et al., 2024).

6. Practical Guidelines and Best Practices

Practice-focused studies synthesize the following recommendations:

Match Surrogate to Noise Model: Choose the surrogate gradient to correspond to the noise model injected in forward computation (e.g., hard-tanh' for uniform noise, tanh' for logistic) (Shekhovtsov et al., 2020, Spallanzani et al., 2022).
Scaling Factors: When using $\pm1$ encodings, retain the correct scaling (e.g., factor 2 in the Jacobian) in the backward pass; missing this can cause significant bias in gradient estimates (Shekhovtsov et al., 2020).
Stochastic STE Preference: Prefer stochastic STE in early training, where units are not yet confident; deterministic STE is appropriate only in the limit of high certainty (Shekhovtsov et al., 2020).
Bias Correction and Hybrid Approaches: Correct accumulated bias by periodically switching to unbiased estimators (e.g., ARM, Monte Carlo) or fine-tuning with theoretically motivated mirror descent updates (Yang et al., 27 Oct 2025, Malinovskii et al., 2024).
Synchronize Surrogate Schedules: When mixing multiple surrogates or smoothing/annealing schedules across layers, properly synchronize annealing to guarantee layer-wise convergence to the quantized target (Spallanzani et al., 2022).
Continuous Relaxations for Structured Sparsity: For N:M or block-structured pruning, use continuous projections (e.g., soft-thresholding) and fixed scaling to correct the three main pathologies of discontinuous STE: descent errors, unpredictability, and mask oscillation (Hu et al., 2024).

References:

"Reintroducing Straight-Through Estimators as Principled Methods for Stochastic Binary Networks" (Shekhovtsov et al., 2020)
"PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression" (Malinovskii et al., 2024)
"Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training" (Wu et al., 2023)
"Improving the Straight-Through Estimator with Zeroth-Order Information" (Yang et al., 27 Oct 2025)
"Training Discrete Deep Generative Models via Gapped Straight-Through Estimator" (Fan et al., 2022)
"Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation" (Liu et al., 2021)
"Injecting Logical Constraints into Neural Networks via Straight-Through Estimators" (Yang et al., 2023)
"Understanding Straight-Through Estimator in Training Activation Quantized Neural Nets" (Yin et al., 2019)
"Learning low-precision neural networks without Straight-Through Estimator(STE)" (Liu et al., 2019)
"Improving Discrete Optimisation Via Decoupled Straight-Through Gumbel-Softmax" (Shah et al., 2024)
"Training Quantised Neural Networks with STE Variants: the Additive Noise Annealing Algorithm" (Spallanzani et al., 2022)
"Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning" (Mihaylova et al., 2020)
"Beyond Discreteness: Finite-Sample Analysis of Straight-Through Estimator for Quantization" (Jeong et al., 23 May 2025)
"Straight-Through meets Sparse Recovery: the Support Exploration Algorithm" (Mohamed et al., 2023)
"Straight-Through Estimator as Projected Wasserstein Gradient Flow" (Cheng et al., 2019)
"Extending Straight-Through Estimation for Robust Neural Networks on Analog CIM Hardware" (Feng et al., 16 Aug 2025)
"High-Dimensional Learning Dynamics of Quantized Models with Straight-Through Estimator" (Ichikawa et al., 12 Oct 2025)
"Custom Gradient Estimators are Straight-Through Estimators in Disguise" (Schoenbauer et al., 2024)
"S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training" (Hu et al., 2024)