Surrogate Gradient Backpropagation

Updated 10 March 2026

Surrogate gradient backpropagation is a method that replaces non-differentiable components with smooth derivative approximations, enabling gradient-based optimization in spiking neural networks.
It leverages various surrogate functions—such as sigmoid, piecewise-linear, and parametric variants—to improve convergence, robustness, and efficiency in neuromorphic and discrete optimization tasks.
The approach underpins practical advances in neuroscience-inspired computing, real-time hardware implementations, and combinatorial problem-solving, despite inherent biases in the surrogate estimations.

Surrogate gradient backpropagation refers to a class of techniques for training neural networks with non-differentiable components—most notably, spiking neural networks (SNNs) with hard thresholding nonlinearities—by substituting the intractable analytic gradients with smooth, tractable “surrogate” derivatives during the backward pass. These approaches are foundational for enabling gradient-based learning in both deterministic and stochastic discrete models and have broad implications for neuroscience-inspired computing, neuromorphic systems, and discrete optimization.

1. Fundamental Principles and Theoretical Formulation

Classic backpropagation relies on differentiable activation functions. In spiking neuron models, the output is often a binary function of the membrane potential: $S = \Theta(u - \theta)$ , where $\Theta(\cdot)$ is the Heaviside step function. The true derivative is zero almost everywhere and undefined at threshold, completely blocking the gradient signal during backpropagation.

Surrogate gradient (SG) methods circumvent this by replacing $\frac{\partial \Theta(u)}{\partial u}$ with a smooth function $\sigma'(u)$ —for example, the derivative of a sigmoid or fast sigmoid centered on the threshold—only in the backward computation. The forward dynamics, spiking events, and resets remain discrete and unaltered (Neftci et al., 2019, Gygax et al., 2024). Thus, gradient-based optimization algorithms (e.g., SGD, Adam) can be applied as usual.

Mathematically, in a loss-backpropagation step for a deterministic SNN, one computes: $\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y}\, \sigma'(u)\, \frac{\partial u}{\partial w}$ where $\frac{\partial L}{\partial y}$ is the error propagated from upstream, and $\sigma'(u)$ replaces the intractable derivative (Gygax et al., 2024). In the presence of stochasticity—such as in escape-noise neuron models with firing probability $\sigma_\beta(u) = \frac{1}{1+\exp(-\beta u)}$ —the surrogate derivative matches exactly the derivative of the escape-noise function, connecting empirical SG practice to a rigorous theoretical foundation (Gygax et al., 2024).

2. Probabilistic and Stochastic Underpinnings

The theoretical justification for surrogate gradients is twofold:

a. Smoothed Probabilistic Models: In stochastic neuron models, spike emission is modeled as $S \sim \mathrm{Bernoulli}(\sigma_\beta(u))$ . The expected loss gradient formalizes to: $\frac{\partial \mathbb{E}[L]}{\partial w} = (L(1) - L(0))\,\beta\,\sigma_\beta(u)(1-\sigma_\beta(u))x$ where $\beta\,\sigma_\beta(u)(1-\sigma_\beta(u))$ is the exact derivative of the escape-noise sigmoid (Gygax et al., 2024). In single-neuron systems, the surrogate derivative matches this gradient exactly.

b. Stochastic Automatic Differentiation (stochAD): stochAD provides a principled way to propagate “stochastic derivatives” through discrete random sampling steps. Through affine combinations of stochastic derivative triples and chain-rule compositions, one can analytically derive that for Bernoulli nodes the “smoothed” stochastic gradient matches the surrogate gradient computation in SNNs at the node level (Gygax et al., 2024).

These equivalences formally justify the use of surrogate gradients in stochastic SNNs and demonstrate that choosing $\sigma'(u)$ to match the escape noise slopes is theoretically optimal in these settings.

3. Methodological Instantiations and Variants

Surrogate gradient backpropagation encompasses a variety of implementations, both within SNNs and for general discrete/black-box optimization:

Choice of Surrogate Function: Common surrogates include the fast-sigmoid ( $\sigma'(u) = \beta \, \mathrm{sigmoid}(\beta u) (1 - \mathrm{sigmoid}(\beta u))$ ), piecewise-linear proxies ( $\gamma(1 - |u|/\Delta)$ within a window $\Delta$ ), exponential or Gaussian windows, and more recently parameterized families with trainable slope/shape (Wang et al., 2023, Neftci et al., 2019).
Learnable Surrogates: Recent advances optimize the parameters of the surrogate function (e.g., the slope) during training, improving convergence and robustness. Parametric surrogates can be learned jointly with network weights (Wang et al., 2023, Jiang et al., 2023).
Forward Gradient Injection (FGI): FGI injects an arbitrary surrogate derivative into the computational graph at the forward stage using only standard operations and detachment, eliminating the need for custom backward overrides and enabling better symbolic optimization and JIT support (Otte, 2024).

Pseudocode for SG-based backpropagation in SNNs universally follows a structure of performing forward spike-based updates, replacing the backward gradient through spikes with the surrogate derivative, and propagating this through time/structure per BPTT or RTRL (Neftci et al., 2019, Yang, 2020, Wang et al., 2023, Otte, 2024).

4. Bias, Non-Conservativity, and Empirical Effectiveness

Surrogate gradient updates are inherently biased: they are not true gradients of any scalar “surrogate loss” function unless in trivial or single-neuron cases (Gygax et al., 2024). This is shown rigorously via path integral arguments: integrating the “surrogate Jacobian” $\partial \tilde{y}/\partial w$ over closed paths yields a nonzero result, violating conservativity—a property of true gradients (Gygax et al., 2024). As a result, SG updates may not be sign-concordant with the true gradient everywhere, and pathological counterexamples can be constructed where the SG update points in an adverse direction.

Despite this, empirical validation across a wide array of SNN tasks demonstrates high effectiveness:

Both deterministic and stochastic SNNs trained with SG achieve low losses on spike timing and classification (e.g., 96 ± 2 % validation accuracy on SHD), with stochastic models preserving controlled neural variability (Fano factor in hidden layers $\sim 1.5$ –$2$) (Gygax et al., 2024).
Learning performance is robust to the specific surrogate adopted, provided it is sufficiently smooth and appropriately scaled. Parametric or learnable surrogates provide moderate (~1–2%) gains in accuracy and reduced latency/memory requirements (Wang et al., 2023, Jiang et al., 2023).
In rate-coding regimes, “rate-based” backpropagation collapses the entire temporal graph, using averages and surrogate derivatives to approximate BPTT gradients within 0.1–0.3% accuracy, while reducing memory/time by up to a factor of $T$ (Yu et al., 2024).

5. Applications Beyond SNNs: Black-Box and Discrete Optimization

The surrogate gradient paradigm generalizes to settings where analytic gradients are unavailable:

Black-box Optimization (ES): Surrogate gradients obtained from truncated backprop, critics, or heuristic directions can be optimally blended with random search directions to produce ES updates that converge faster to the true gradient than standard ES alone. This is formalized in hybrid estimators guaranteeing progress (Meier et al., 2019).
Combinatorial Solvers: For discrete optimization layers (e.g., graph matching, subset selection), the “identity with projection” surrogate amounts to backpropagating the negative incoming gradient via the solver’s input, projected onto the solution-relevant subspace. This approach is empirically competitive with more elaborate relaxation or interpolation-based surrogates, but has no conservative integral and requires no hyperparameter tuning (Sahoo et al., 2022).

6. Algorithmic Details and Empirical Protocols

Surrogate gradient backpropagation is typically operationalized as follows:

Forward Pass: Propagate binary spike computations using exact Heaviside/thresholder.
Backward Pass: Replace $\frac{\partial \Theta(u)}{\partial u}$ with a smooth $\sigma'(u)$ everywhere the gradient would traverse a spike event.
Temporal Recursion: In BPTT, propagate the error using the temporal Jacobian, which may involve additional terms for state reset dependencies (e.g., $-\alpha u_t \phi(u_t-\theta)$ for LIF neurons with soft reset) (Yang, 2020).
Parameter Update: Accumulate gradients through layer weights and any surrogate parameters, and update via optimizer-specific policy (Neftci et al., 2019, Wang et al., 2023).
Sparsity and Multi-Objective Losses: Surrogate-based backprop allows the explicit addition of spike-count penalties to the objective, optimizing both task accuracy and energy/spiking sparsity in physically constrained deployments (Allred et al., 2020).

Surrogate gradient pipelines are implemented in deep learning frameworks using either custom backward passes, or more efficiently forwards with FGI (“forward gradient injection”) to enable graph-level optimizations and JIT compilation (Otte, 2024).

7. Impact, Hardware Implications, and Limitations

Surrogate gradient backpropagation has revolutionized the training of SNNs and other discrete models, unlocking practical, scalable learning on neuromorphic hardware, real-time applications, and online few-shot learning (Stewart et al., 2019). SG-trained SNNs are deployed on chips such as Intel’s Loihi, leveraging local three-factor learning rules that inherit from the SG formalism.

However, limitations remain due to the bias and non-conservativity of surrogate updates. On tasks requiring extremely precise spike timing, or with pathological network architectures, the mismatch between the SG and the true gradient can cause suboptimal or unstable optimization (Gygax et al., 2024). Online and memory-efficient alternatives (e.g., feedforward SpikingGamma neurons) offer surrogate-free routes for temporally precise, scalable learning under high temporal resolutions, but may not supplant SG approaches in arbitrary architectures (Koopman et al., 2 Feb 2026). The theoretical analysis confirms that careful choice and scaling of the surrogate, matching the underlying stochastic process when possible, is critical for both convergence and stability (Gygax et al., 2024, Jiang et al., 2023).

References:

(Gygax et al., 2024) Elucidating the theoretical underpinnings of surrogate gradient learning in spiking neural networks
(Neftci et al., 2019) Surrogate Gradient Learning in Spiking Neural Networks
(Wang et al., 2023) Membrane Potential Distribution Adjustment and Parametric Surrogate Gradient in Spiking Neural Networks
(Jiang et al., 2023) KLIF: An optimized spiking neuron unit for tuning surrogate gradient slope and membrane potential
(Yu et al., 2024) Advancing Training Efficiency of Deep Spiking Neural Networks through Rate-based Backpropagation
(Yang, 2020) Temporal Surrogate Back-propagation for Spiking Neural Networks
(Otte, 2024) Flexible and Efficient Surrogate Gradient Modeling with Forward Gradient Injection
(Meier et al., 2019) Improving Gradient Estimation in Evolutionary Strategies With Past Descent Directions
(Koopman et al., 2 Feb 2026) SpikingGamma: Surrogate-Gradient Free and Temporally Precise Online Training of Spiking Neural Networks with Smoothed Delays
(Sahoo et al., 2022) Backpropagation through Combinatorial Algorithms: Identity with Projection Works
(Stewart et al., 2019) On-chip Few-shot Learning with Surrogate Gradient Descent on a Neuromorphic Processor
(Allred et al., 2020) Explicitly Trained Spiking Sparsity in Spiking Neural Networks with Backpropagation