Beyond-Backpropagation Training
- Beyond-backpropagation training methods are optimization techniques that update neural network parameters locally rather than via global chain-rule based gradients.
- They enable parallel, energy-efficient, and hardware-friendly training by decoupling parameter updates across layers and supporting non-differentiable modules.
- Surveyed techniques like Forward Thinking, feedback alignment, and target propagation demonstrate practical improvements in training speed, scalability, and biological plausibility.
A beyond-backpropagation training method refers to any neural network optimization procedure that avoids global chain-rule–based error propagation (as in standard backpropagation or its derivatives) to update network parameters. Instead, these approaches employ local, often layer- or module-specific, objectives and error signals, facilitating independent parameter updates, often enhanced parallelizability, support for non-differentiable components, increased biological plausibility, and—depending on the method—substantial hardware or energy efficiency advantages. This article provides a comprehensive survey of representative beyond-backpropagation techniques, their principles, formalism, performance, and computational implications, referencing contemporary research in the field.
1. Principles and Motivations for Beyond-Backpropagation Approaches
The core driver for beyond-backpropagation algorithms is the elimination of global credit assignment via backpropagation, which is hampered by several issues:
- Sequential "backward locking", impeding parallel and pipelined training;
- Sensitivity to vanishing/exploding gradients, especially in deep or recurrent architectures;
- Dependence on differentiability, precluding integration of black-box or non-differentiable modules like decision trees;
- Incompatibility with large-scale distributed or specialized hardware due to extensive inter-layer communication;
- Biological implausibility, as evidence suggests no analog in natural brains for weight transport or global backward error signaling.
This led to the development of methods where optimization decouples parameter updates across layers or modules, leverages local learning rules, or propagates synthetic or target signals independently from the gradient chain. These paradigms include greedy layer-wise training, target propagation, feedback alignment, direct prediction-based approaches, and evolutionary rule discovery (Hettinger et al., 2017, Kao et al., 2019, Launay et al., 2020, Short, 2023, Alber et al., 2018).
2. Local and Greedy Training Schemes
A major class of beyond-backpropagation methods decomposes the global training problem into layer- or block-wise subproblems, each solved without knowledge of, or gradients from, later layers.
a. Forward Thinking
The "Forward Thinking" method grows a deep network one layer at a time. At each step, a new layer is introduced, trained on the output of the previous layer (with labels), and then "frozen". The entire dataset is mapped through the newly trained layer, and the architecture is augmented by retraining on this mapped data (Hettinger et al., 2017). All previous layers remain immutable, and only the newly added layer is trained at each iteration, thereby avoiding error backpropagation over the stack.
Mathematically, this is:
Forward Thinking supports arbitrary learners per layer, including non-differentiable ones, e.g., decision trees, due to the independence of local objectives.
b. Backprojection and Kernelized Layerwise Learning
The "Backprojection" method alternates between projecting inputs up through the network and reconstructing labels down, aligning each hidden-layer's representation to a label reconstruction at that depth. Each layer's weights are then tuned to minimize the discrepancy between its forward-projected data and its local label reconstruction, without global backpropagation. The kernelized version extends this approach to a reproducing kernel Hilbert space, ensuring generalized representation power while retaining layerwise independence (Ghojogh et al., 2020).
c. Layer-Wise Kernel Machines
Layerwise training is also formalized via kernel machinery, where each layer is defined as a “kernel machine” fitted through an explicit loss (e.g., supervised representation-similarity, hinge loss) that is constructed so the optimal hidden mapping coincides with the component-minimizer of the end-to-end objective, guaranteeing that the greedy solution is globally optimal for suitable losses (Duan et al., 2018).
3. Local Objectives and Target Decoupling
Several frameworks decompose neural networks into semi-autonomous learning machines or modules, each governed by a local error signal or trainable target.
a. Associated Learning and Modularization
Associated Learning (AL) divides a deep network into independent modules, each with its forward mapping and a local objective, including bridge mappings and independent autoencoders. Each module minimizes a sum of a target-matching loss and an autoencoder reconstruction loss, and modules are trained in parallel, without crossing gradient dependencies (Kao et al., 2019). The local objectives are:
This folding structure eliminates "backward locking", admitting complexity rather than the sequential of backpropagation.
b. Zenkai and Semi-Autonomous Layer Abstractions
The Zenkai framework exposes each layer as a "LearningMachine" with definable update and target-propagation rules: step and step can be implemented as conventional gradient descent, or as evolutionary, hill-climbing, or any procedure compatible with per-layer optimization. This unlocks non-differentiable support, hybrid feedback alignment, stochastic search, or population-based methods at the layer level (Short, 2023).
4. Feedback and Synthetic Gradients
Layer- or module-wise updates can also be driven by proxy or synthetic gradients that do not require full backpropagation.
a. Direct Feedback Alignment and Hardware Implications
Direct Feedback Alignment (DFA) eschews the sequential backprop chain by projecting the global output error back to each hidden layer via fixed, random feedback matrices. This breaks the dependency on upstream gradients and enables layer-parallel computation. The per-layer updates no longer require upstream weights:
On distributed architectures, especially at trillion-parameter scale, DFA’s fixed feedback reduces all-to-all communication, offering orders-of-magnitude hardware savings if implemented in, e.g., optical co-processors (Launay et al., 2020). However, scaling laws indicate DFA fails to close the performance gap with BP for large transformers, with strictly worse loss–compute scaling and no regime in which compute savings offset accuracy loss (Filipovich et al., 2022).
b. Cascaded Forward and Block-wise Local Prediction
The "Cascaded Forward" (CaFo) algorithm attaches independent predictors to each cascaded block (e.g., a CNN block), each trained to predict labels directly from its local features. Training can use closed-form regression, cross-entropy, or sparsemax losses, with optional pre-training via DFA-style feedback. No chain-rule gradients between blocks are used, and predictors can be trained in parallel after a single forward pass (Zhao et al., 2023).
5. Evolutionary and Alternative Local Update Rule Discovery
Automated search over non-standard backward equations—"Backprop Evolution"—formalizes the search for faster-converging surrogates of classical backprop by allowing layerwise update rules as DSL expressions composed of forward activations, backward signals, and random feedbacks. The evolutionary process discovers rules involving normalization, clipping, and stochastic feedback which can accelerate early training, though they ultimately match BP's asymptotic convergence (Alber et al., 2018).
6. Empirical Performance, Efficiency, and Limitations
A range of beyond-backpropagation methods have shown competitive, and in some cases superior, performance and efficiency compared to classical BP, particularly in terms of:
- Wall-clock speed per epoch and earlier convergence for Forward Thinking and Mono-Forward on fully connected and convolutional benchmarks (Hettinger et al., 2017, Gong et al., 16 Jan 2025).
- Pipeline and energy efficiency for Mono-Forward, with up to 41% reduction in energy usage and 34% less training time while maintaining or surpassing BP accuracy in multi-layer perceptrons (Spyra, 23 Sep 2025).
- Near parity of accuracy between modular methods (e.g., Associated Learning, Zenkai, Cascaded Forward) and BP on MNIST, CIFAR-10, and fashion datasets, with some methods slightly surpassing BP on specific architectures or in robustness/noisy data regimes (Kao et al., 2019, Short, 2023, Zhao et al., 2023, Gong et al., 16 Jan 2025).
However, limitations persist:
- Purely greedy or local methods may overfit early layers or inadequately coordinate cross-layer representations, necessitating careful regularization, stopping criteria, or possible hybridization with BP for final fine-tuning (Hettinger et al., 2017, Gong et al., 16 Jan 2025).
- Random feedback alignment methods generally display inferior scaling, and attempts to close the performance gap at scale (e.g., in Transformers) have not succeeded (Filipovich et al., 2022).
- The lack of full gradient for global objectives can lead to suboptimal overall solutions in architectures with strong inter-layer dependencies unless the local proxies are tightly aligned with global task requirements.
- Some methods introduce auxiliary parameters (e.g., projection matrices, local predictors) or require extra computation for local objectives, though typically still less than global BP.
7. Theoretical and Practical Implications
The adoption of beyond-backpropagation methods has profound consequences for the design of neural network training:
- Decoupling updates unlocks massive pipeline and device-parallelism, especially on distributed or neuromorphic hardware (Launay et al., 2020, Gong et al., 16 Jan 2025).
- Local learning rules enhance interpretability and facilitate modular architecture design, supporting non-differentiable and black-box components (Short, 2023, Hettinger et al., 2017, Zhao et al., 2023).
- Biological plausibility is significantly improved: local objectives, absence of backward weight transport, and the opportunity for continual, online, or on-device learning respond to critiques of conventional BP from computational neuroscience (Short, 2023, Millidge et al., 2022).
- On recurrent and dynamical systems, unbiased stochastic gradient estimators built via rank-one approximations (e.g., NoBackTrack) allow online, memory-efficient training that outperforms truncated BPTT on tasks with long-range dependencies (Ollivier et al., 2015).
- Certified optimality (in certain settings, e.g., layerwise kernels) is possible for specific losses, guaranteeing that locally optimal solutions are globally optimal for the network as formulated (Duan et al., 2018).
- Energy and memory cost reductions for hardware-constrained training, as well as prospects for lower environmental impact via sustainable AI practices, are realized in Mono-Forward’s energy/CO₂e savings (Spyra, 23 Sep 2025).
Open questions remain about scaling laws for all local methods and their generalization when extended to transformers or large multi-modal models. Progress in hardware-aware algorithm design, cross-layer coordination, hybrid strategies, and extension to unsupervised and semi-supervised settings is ongoing.
References:
- "Forward Thinking: Building and Training Neural Networks One Layer at a Time" (Hettinger et al., 2017)
- "Associated Learning: Decomposing End-to-end Backpropagation based on Auto-encoders and Target Propagation" (Kao et al., 2019)
- "Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment" (Launay et al., 2020)
- "Zenkai -- Framework For Exploring Beyond Backpropagation" (Short, 2023)
- "Backprop Evolution" (Alber et al., 2018)
- "Mono-Forward: Backpropagation-Free Algorithm for Efficient Neural Network Training Harnessing Local Errors" (Gong et al., 16 Jan 2025)
- "Predictive Coding: Towards a Future of Deep Learning beyond Backpropagation?" (Millidge et al., 2022)
- "Backprojection for Training Feedforward Neural Networks in the Input and Feature Spaces" (Ghojogh et al., 2020)
- "On Kernel Method-Based Connectionist Models and Supervised Deep Learning Without Backpropagation" (Duan et al., 2018)
- "The Cascaded Forward Algorithm for Neural Network Training" (Zhao et al., 2023)
- "Beyond Backpropagation: Exploring Innovative Algorithms for Energy-Efficient Deep Neural Network Training" (Spyra, 23 Sep 2025)
- "Training recurrent networks online without backtracking" (Ollivier et al., 2015)