Adjoint Method for Backpropagation

Updated 2 May 2026

Adjoint Method for Backpropagation is a framework that computes gradients by solving an auxiliary backward differential equation, bypassing the need for full Jacobian computations.
It delivers memory- and compute-efficient gradient evaluations in neural ODEs, graph convolutional models, and simulation-based inference across diverse applications.
This method unifies classical backpropagation with advanced techniques in hardware implementations, such as photonic networks and memristor arrays, ensuring scalable and accurate training.

The adjoint method for backpropagation is a principled framework for computing gradients in dynamical systems, neural networks (including ODE-based architectures), and many other domains where parameter sensitivities must be efficiently evaluated. It generalizes and unifies the conceptual basis of backpropagation by formulating gradient computation as the solution to an adjoint (dual) system—typically an ordinary or stochastic differential equation integrated backward in time. This approach enables memory- and compute-efficient gradient evaluation, avoids explicit computation of full Jacobians, and, in modern contexts, underpins neural ODEs, hardware-in-the-loop training protocols, and simulation-based inference.

1. Mathematical Foundations and Core Formalism

Given a forward evolution equation such as a neural ODE

$\frac{dh}{dt} = f(h(t), t, \theta),\quad h(t_0) = h_0,$

the adjoint method introduces an auxiliary variable, the adjoint state $a(t)$ , defined as the instantaneous gradient of the loss

$a(t) \equiv \frac{\partial L}{\partial h(t)}.$

The adjoint dynamics follow from Pontryagin’s maximum principle or the calculus of variations: $\frac{da}{dt} = -\left[\frac{\partial f}{\partial h}\right]^\top a, \quad a(t_1) = \frac{\partial L}{\partial h(t_1)}.$ The gradient with respect to parameters $\theta$ is then

$\frac{\partial L}{\partial \theta} = -\int_{t_1}^{t_0} a(t)^\top \frac{\partial f}{\partial \theta} \, dt.$

For practical computation, this results in an augmented system where the forward state, adjoint, and propagated parameter gradient are solved in one backward scan: $\frac{d}{dt}\begin{pmatrix} h \ a \ z \end{pmatrix} = \begin{pmatrix} f(h, t, \theta) \ -\left[\frac{\partial f}{\partial h}\right]^\top a \ -\left[\frac{\partial f}{\partial \theta}\right]^\top a \end{pmatrix}.$ This continuous formulation is essential in neural ODEs, physics-based simulations, and when backpropagating through complex non-discrete computation graphs (Cai, 2022, Li et al., 2022).

2. Algorithmic Realizations: From Classical Backpropagation to Vectorized Adjoint Methods

Classical backpropagation in feedforward deep networks is an application of the adjoint operator sequence to the chain of affine and nonlinear activations. The “F-adjoint” framework formalizes this by treating the backward pass as the adjoint (transpose) of the forward propagation (Boughammoura, 2023, Boughammoura, 2024). Explicitly, for layers $Y^h = W^h X^{h-1}$ and activations $X^h = \sigma(Y^h)$ , the backward recursion is

$Y_*^h = X_*^h \odot \sigma'(Y^h), \quad X_*^{h-1} = (W^h)^\top Y_*^h,$

with $a(t)$ 0 and, crucially, $a(t)$ 1.

In more general settings, such as graph convolutional neural ODEs, the adjoint method is vectorized to support efficient implementation and hardware specialization. When $a(t)$ 2 is the core operation, the adjoint update obeys

$a(t)$ 3

where $a(t)$ 4. For GCDEs, this enables adjoint updates for both node features and graph weights without instantiating large Jacobians, using only matrix products and pointwise operations (Cai, 2022).

3. Specialized Extensions: Fractional, Stochastic, and Constrained Dynamics

The adjoint framework extends beyond deterministic ODEs:

Neural Fractional-Order Differential Equations: When system dynamics involve Caputo derivatives of order $a(t)$ 5, the backward pass involves a right-sided Caputo derivative on the adjoint state:

$a(t)$ 6

The gradient is then assembled as

$a(t)$ 7

with significant memory advantages over direct differentiation through the solver (Kang et al., 20 Mar 2025).

Stochastic Differential Equations (SDEs): The adjoint method is generalized to systems with multiplicative noise by formulating a backward SDE for the co-state $a(t)$ 8. The approach (adjoint path-kernel) includes kernel-based damping to mitigate gradient explosion in unstable or chaotic regimes (Ni, 29 Jul 2025).
Chaos and Long-Time Statistics: For uniformly hyperbolic systems, the adjoint shadowing operator $a(t)$ 9 on covector fields is constructed, enabling the computation of linear responses via adjoint backward iteration combined with projection onto stable/unstable bundles (Ni, 2022).

4. Memory-Efficient Implementations and Discretization Considerations

A core strength of the adjoint method is its dramatically reduced memory footprint compared to unrolling autograd graphs or checkpointing. For neural ODEs or PM N-body simulations, the adjoint only stores the state at the current integration point, optionally reusing recomputed forward trajectories rather than storing the full forward history (Li et al., 2022, Matsubara et al., 2021). In high-resolution simulations or models with thousands of time-steps or parameters, this enables practical training and inference within fixed GPU resources.

However, for consistency, the discretization of the adjoint system must be carefully matched to the forward solver. Only when the backward integrator uses the exact (operator) adjoint of the forward integrator can the adjoint gradient be guaranteed to match the gradients produced by backpropagation through the unrolled computation (Hu, 2024, Matsubara et al., 2021). Symplectic integrators resolve this for explicit Runge–Kutta methods in neural ODEs, ensuring conservation of the relevant variational bilinear form and hence unbiased gradients at any step-size (Matsubara et al., 2021).

5. Adjoint Methods in Hardware, Photonics, and Gauge-Consistent Learning

Adjoint formulations are especially suited to hardware implementations:

Photonic Neural Networks: The adjoint variable method enables in situ backpropagation: forward and backward field propagation, with physical time-reversal and interferometric measurement, recovers exact parameter gradients via the physical adjoint of the Maxwell operator. This requires only three optical passes per layer with parallel intensity measurements, mapping perfectly onto experimental constraints (Hughes et al., 2018).
Memristor and Crossbar Arrays: The vectorized adjoint is architected to avoid materializing Jacobians, instead mapping all updates to low-rank matrix products, perfectly suited to two-dimensional crossbar accelerators, with massive parallelism and minimized data movement (Cai, 2022).
Unit-Consistent and Gauge-Invariant Learning: For positively homogeneous networks (e.g., ReLU MLPs), standard backpropagation is not equivariant to node-wise diagonal rescalings (“gauge transformations”). The unit-consistent (UC) adjoint replaces the conventional transpose with a gauge-covariant adjoint, yielding gauge-consistent steepest descent updates and invariance to affine basis choice. The update is parameterization-agnostic and maintains equivalence classes under node-wise scaling (Uhlmann, 15 Jan 2026).

6. Generalizations and Theoretical Insights

The adjoint method gives a unified operator-theoretic and variational interpretation of backpropagation. In Hilbert-space terms, the backward pass applies the adjoint of each composition in the forward graph, using the inner-product structure to represent the chain rule as a sequence of adjoint operators. This perspective clarifies the relation to classical adjoint methods in PDE-constrained optimization, least squares, and control theory—establishing backpropagation as the canonical adjoint computation under the computational graph (Bui-Thanh, 2023).

In biologically motivated schemes, the Lagrangian (saddle-point) framework provides local learning rules where adjoint (multiplier) variables propagate error signals with only neighbor and local information, potentially circumventing the weight transport problem and achieving asynchronous convergence (Betti et al., 2018, Boughammoura, 2024).

7. Applications and Empirical Impact

Adjoint backpropagation is foundational for:

Neural ODEs and continuous normalizing flows: Enabling continuous-depth architectures with scalable training (Matsubara et al., 2021).
Graph models and spatiotemporal networks: Supporting efficient learning in settings with structured transformations (Cai, 2022).
Physical and scientific simulations: Allowing field-level gradient-based inference and high-precision parameter estimation, notably for cosmology, molecular dynamics, and physical systems (Li et al., 2022, Kang et al., 20 Mar 2025).
Spiking neural networks: Permitting exact event-based backpropagation via adjoint jumps at spike times (Wunderlich et al., 2020).
Diffusion models: Making large-scale DPM customization and optimization feasible with constant memory via augmented ODE adjoints (Pan et al., 2023).

The adjoint method’s flexibility, operator-theoretic clarity, and resource efficiency have made it the method of choice for scalable, physically motivated, and hardware-efficient learning systems throughout scientific computing and modern machine learning.