Theoretical Convergence to Backpropagation

Updated 17 March 2026

The paper establishes that alternative learning rules, including stochastic control and predictive coding, can achieve exact or asymptotically equivalent BP gradients.
It employs rigorous methods from stochastic analysis, variational inference, and local Hebbian rules to show how neural network updates match BP under specific conditions.
The work highlights that factors like smoothness, proper initialization, and Jacobian alignment are critical for ensuring convergence and practical algorithm efficiency.

Theoretical convergence to backpropagation (BP) encompasses formal results demonstrating that alternative learning rules, dynamical systems, or biologically motivated algorithms yield parameter updates that exactly or asymptotically match the gradients computed by standard error backpropagation in neural networks. This area integrates stochastic analysis, variational inference, dynamical systems, energy-based models, and the study of local learning rules. Rigorous proofs of convergence, conditions for exact correspondence, and quantification of approximation errors form the technical core of this field.

1. Formal Convergence and Exact Equivalence Results

Multiple independent frameworks establish either exact or asymptotic convergence to the gradients computed by BP, under specific technical conditions:

Sample-wise Backpropagation in Neural SDEs: In stochastic optimal control settings for neural SDEs, a high-order discretization of the forward–backward stochastic differential system (FBSDE) enables the associated SGD scheme to achieve global first-order convergence in network depth, under convexity and smoothness assumptions. The principal result states that with $K \sim O(N^3)$ SGD steps for network depth $N$ , the iterates satisfy

$\mathbb{E}\left[\left\|u^{K+1} - u^*\right\|_2^2\right] \leq C(N/K + 1/N^2)$

matching the theoretical rate of discretized BP for deterministic ODEs, with only modest added per-step cost due to increased order in the discretization (Sheng et al., 8 Sep 2025).

Emergence of BP from Biologically Motivated Plasticity: A three-factor Hebbian-style rule, incorporating neural firing rates, retrograde "credit" diffusion, and postsynaptic E–I balance, is shown to yield updates mathematically identical to BP in layered networks, provided that the balance function $f(s)=\sigma'(s)$ and input/output clamping are employed. The credit redistribution along weight/Jacobian chains precisely reproduces BP gradients at every layer and synapse, with no approximation (Fan et al., 2024).
Predictive Coding Networks (PCNs): Predictive coding algorithms, under the so-called Z-IL ("zero free energy–instantaneous learning") regime, produce weight updates

$\Delta\Theta_{l+1} \propto \varepsilon_l f(x_{l+1})^T$

that coincide exactly with BP gradients in both feedforward and recurrent architectures. This holds after a finite sequence of local inference steps and under feedforward initialization (Salvatori et al., 2021, Millidge et al., 2022).

Difference Target Propagation (DTP): When local feedback pathways are trained to enforce the Jacobian Matching Condition (JMC), DTP yields gradients

$\frac{\partial\mathcal{L}_\mathrm{pred}}{\partial\theta^n} = \lim_{\beta \to 0} \frac{1}{2\beta} \frac{\partial}{\partial\theta^n} \| t^n_\beta - s^n\|^2$

which match exactly the BP gradients. Local feedback training via Difference Reconstruction Loss ensures this exactness, restoring both biological plausibility and full theoretical guarantees (Ernoult et al., 2022).

Belief Propagation as a Superclass: Loopy belief propagation (BP) on a "lifted" factor graph, constructed from a computation graph with delta and Boltzmann potentials, produces downward messages whose log-derivatives at the data clamping points coincide with BP adjoints. This establishes formal equivalence between BP and backpropagation in deterministic neural networks under precise graphical conditions (Eaton, 2022).

2. Conditions for Convergence and Exactness

Rigorous convergence to BP requires specific structural and dynamical constraints, dependent on the underlying model:

Smoothness and Convexity: Sufficient smoothness of activation functions, cost, and network mappings, as well as convexity (in control/Loss) for SGD-based schemes, is generally necessary for global convergence rates and stability (e.g., (Sheng et al., 8 Sep 2025)).
Initialization and Scheduling: Exact equivalence for predictive coding and related schemes (e.g., Z-IL protocol) demands feedforward initialization of activities (zero local errors), careful scheduling of inference and weight updates, and fixed integration steps equal to the network depth (Salvatori et al., 2021, Millidge et al., 2022).
Jacobian Alignment: For target propagation and DTP, exact matching to BP occurs only when feedback pathways’ Jacobians match the transposes of their forward counterparts. This can be enforced via local Jacobian-matching losses using random perturbations (Ernoult et al., 2022).
Energy-Based Model Structure: For energy-based learning (predictive coding, equilibrium propagation, contrastive Hebbian learning), a decomposed energy $E=I+\lambda L$ with $C^2$ smoothness, existence and uniqueness of the free-phase equilibrium, and the infinitesimal inference limit $\|s^*(\lambda)-\bar{s}\|=O(\lambda)$ , ensures that the contrastive weight update converges to the BP gradient (Millidge et al., 2022).
Delta-Function Graphical Structure: The BP vs. belief propagation correspondence demands "single-output, at-most-one-parent" computation graphs and variable clamping via delta functions, to ensure that upward and downward messages collapse to exact function (forward) and gradient (backward) propagation (Eaton, 2022).
Additional Cases: For special algorithms, such as Front-Contribution, structural conditions such as piecewise-linear activations and feedforward architecture without skip connections or normalization layers are required for exact equivalence to BP (Mishra et al., 2021).

3. Convergence Rates and Quantitative Bounds

Convergence rates to the BP solution or gradients are frequently established in terms of the network depth, step size, or order of discretization:

Method/Class	Convergence Rate	Key Constraint
High-order BSDE Backprop (Neural SDEs)	$O(1/N)$ in depth ( $N$ steps/layers)	$K = O(N^3)$ SGD steps
Euler-type BSDE Backprop	$O(1/N^{1/2})$ (half-order)	$K = O(N^3)$
Predictive Coding (Z-IL)	Exact after $L$ inference steps ( $L$ =depth/layers)	Feedforward/Z-IL initialization
DTP (with JMC via L-DRL)	Exact, up to numerical error in JMC enforcement	Local feedback matched

These rates are typically proved via variance bounds, discretization error analysis, and recursion inequalities. For energy-based models, first-order Taylor expansions in the inference error parameter provide the analytical basis for convergence proofs (Millidge et al., 2022).

4. Impact of Local vs. Non-local Learning Rules

A central thread in the literature is demonstrating that local, spatially and temporally constrained plasticity rules or inference steps yield the same parameter update direction as non-local, chain-rule–based BP:

Three-Factor Hebbian Rules: Credit redistribution along synaptic paths, coupled with local firing rates and balance functions, reconstructs the adjoint chain of derivatives implicit in BP (Fan et al., 2024).
Predictive Coding and Contrastive Energy-Based Updates: Both local activity-error interactions and local energy contrasts at equilibrium capture the backpropagated error signals, provided the free-phase equilibrium removes internal energy gradients (Millidge et al., 2022, Millidge et al., 2022).
Local Feedback Alignment and Target Propagation: When local feedback modules optimize a Jacobian matching loss—rather than global pseudoinverse or random alignment—the resulting update direction provably matches BP (Ernoult et al., 2022).
Spiking and Compartmental Models: Sign-concordant feedback alignment, backed by appropriate plasticity and circuit constraints (e.g., Dale’s law, local normalization), allows even spike-based microcircuits to align their error signal propagation with BP (Yang et al., 2022).

5. Algorithmic and Computational Implications

Theoretical convergence to BP has direct practical consequences for algorithm efficiency, parallelism, and biological plausibility:

Sample-wise High-Order Backpropagation: Achieving first-order accuracy with respect to network depth enables significantly reduced computational complexity. Target error $\epsilon$ is attained with $N=O(\epsilon^{-1})$ steps and overall work $O(\epsilon^{-4})$ versus $O(\epsilon^{-6})$ for half-order methods (Sheng et al., 8 Sep 2025).
Predictive Coding and Z-IL: Exact BP is realized with only a linear number of local inference steps per layer, enabling energy-efficient or hardware-friendly implementations with strictly local plasticity (Salvatori et al., 2021).
Layerwise and Parallel Training: Local feedback alignment not only restores convergence and scaling but enables fully layerwise or parallel updating, disrupting the backward locking inherent in classical BP and related non-local update rules (Huo et al., 2018, Ernoult et al., 2022).
Front-Contribution Algorithm: By collapsing the contribution of all earlier layers into a closed-form reparameterization of the last layer weights, it is possible to bypass all internal parameter updates while guaranteeing outputs identical to BP at every step, under certain restrictions (Mishra et al., 2021).

6. Limitations, Open Problems, and Outlook

While the theoretical results are extensive and often striking in their precision, several limitations and open questions remain:

Many schemes rely on restrictive hypotheses: smoothness, fixed activation regimes, exactness of initialization, or precise scheduling.
Practical convergence in the presence of stochasticity, non-smooth activations (e.g., ReLU with pattern changes), or non-idealized noise has not been uniformly established.
In nonconvex and large-scale deep networks, ensuring all structural conditions (e.g., JMC in DTP, equilibrium uniqueness in energy-based methods) may be challenging.
The incorporation of batch normalization, skip connections, or adaptive gradient methods in these rigorous settings remains largely open.

Continued investigation aims to relax assumptions while preserving, or at least quantifying, the convergence and gradient-matching guarantees. The theoretical unification of biological plausibility, algorithmic parallelism, and gradient-exactness continues to guide the development of both machine learning systems and computational neuroscience models.