Jacobian-Free Backpropagation (JFB)

Updated 7 February 2026

Jacobian-Free Backpropagation (JFB) is a gradient estimation method that bypasses costly Jacobian inversion by substituting it with a zeroth-order (identity) approximation.
It enables efficient training of implicit and equilibrium models, reducing memory usage and computational overhead in bilevel, control, and inverse-problem applications.
Under appropriate contractivity conditions, JFB guarantees a descent direction and competitive empirical performance while approximating the true implicit gradient.

Jacobian-Free Backpropagation (JFB) is a class of gradient estimation techniques designed to enable efficient training of models defined via fixed-point equations or equilibrium conditions, especially in scenarios where standard reverse-mode automatic differentiation is computationally prohibitive due to the necessity of forming or inverting large Jacobian matrices. JFB leverages an approximation that circumvents the bottleneck of implicit differentiation by replacing the expensive Jacobian-inverse term with a computationally lightweight surrogate, offering scalable and memory-efficient training for implicit networks, equilibrium models, and a broad range of bilevel, control, and inverse-problem applications (Liu et al., 2024).

1. Mathematical Foundation and Core Principle

In implicit models, the output $x^*$ is defined as the solution of a fixed-point equation

$T_\Theta(x^*) = x^*$

where $T_\Theta$ is a parameterized operator. Training proceeds by minimizing a scalar loss $\ell(x^*, x_{\mathrm{true}})$ with respect to parameters $\Theta$ . The exact implicit gradient, derived via the Implicit Function Theorem, is

$\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$

where $J_{x}T_\Theta(x^*)$ is the Jacobian $\partial T_\Theta/\partial x$ at $x^*$ . Computing or inverting the $n \times n$ Jacobian ( $T_\Theta(x^*) = x^*$ 0 often being the number of output features) is prohibitively expensive in high-dimensional regimes.

JFB replaces $T_\Theta(x^*) = x^*$ 1 with the identity matrix, yielding the surrogate gradient: $T_\Theta(x^*) = x^*$ 2 This corresponds to taking only the zeroth term in the Neumann expansion of the inverse, i.e., $T_\Theta(x^*) = x^*$ 3 (Liu et al., 2024). This approximation can be justified under contractivity conditions on $T_\Theta(x^*) = x^*$ 4, under which JFB still produces a descent direction for the original loss.

2. Algorithmic Implementation and Computational Benefits

The practical implementation of JFB comprises the following steps:

Fixed-Point Solve: For each training datum (e.g., measurement $T_\Theta(x^*) = x^*$ 5), find the equilibrium state $T_\Theta(x^*) = x^*$ 6 such that $T_\Theta(x^*) = x^*$ 7. This is typically accomplished with a fixed-point iteration or acceleration technique (e.g., Anderson acceleration).
Loss Computation: Evaluate the scalar loss $T_\Theta(x^*) = x^*$ 8.
Jacobian-Free Gradient: Compute the JFB update direction $T_\Theta(x^*) = x^*$ 9 as the product of the gradient of the loss with respect to $T_\Theta$ 0 and the Jacobian of $T_\Theta$ 1 with respect to $T_\Theta$ 2 at $T_\Theta$ 3. This can be realized using two passes in standard autodiff frameworks—one reverse-mode (for $T_\Theta$ 4), one forward-mode (for evaluating $T_\Theta$ 5).
Parameter Update: Perform the gradient descent step $T_\Theta$ 6.

This workflow completely avoids explicit formation, storage, or inversion of large Jacobians. The per-sample memory cost is $T_\Theta$ 7 with no iteration-dependent growth, and the runtime per iteration matches that of a single forward and backward autodiff pass (Liu et al., 2024, Jaffe et al., 2023, Fung et al., 2021). The contrast with unrolled backpropagation and exact implicit differentiation methods is summarized below:

Method	Time Complexity	Memory Complexity
Unrolled Backprop ( $T_\Theta$ 8 steps)	$T_\Theta$ 9	$\ell(x^*, x_{\mathrm{true}})$ 0
Implicit Differentiation	$\ell(x^*, x_{\mathrm{true}})$ 1	$\ell(x^*, x_{\mathrm{true}})$ 2 (requires inversion)
JFB	$\ell(x^*, x_{\mathrm{true}})$ 3	$\ell(x^*, x_{\mathrm{true}})$ 4

( $\ell(x^*, x_{\mathrm{true}})$ 5 is the cost per application of $\ell(x^*, x_{\mathrm{true}})$ 6, $\ell(x^*, x_{\mathrm{true}})$ 7 is the number of unrolled layers or iterations, $\ell(x^*, x_{\mathrm{true}})$ 8 is the solver tolerance) (Liu et al., 2024).

3. Theoretical Guarantees and Approximation Error

The theoretical validity of JFB is underpinned by contractivity assumptions. If $\ell(x^*, x_{\mathrm{true}})$ 9 is a $\Theta$ 0-Lipschitz map in $\Theta$ 1 at equilibrium with $\Theta$ 2, the error between the exact implicit gradient and the JFB approximation is bounded as: $\Theta$ 3 Thus, as $\Theta$ 4 becomes more contractive ( $\Theta$ 5), or as the mapping becomes superlinear (e.g., in Newton-type updates), the JFB approximation approaches the exact implicit gradient (Davy et al., 16 Jun 2025, Bolte et al., 2023).

In supervised problems and bilevel optimization, limiting cases can be made arbitrarily accurate by increasing the contractivity (often controlled by spectral normalization or careful step-size selection). For Newton or strongly convex solvers, the one-step Jacobian converges quadratically as iterates approach the solution.

Provable descent is retained: under contraction, the JFB update is guaranteed to be a descent direction for the true loss (Liu et al., 2024, Fung et al., 2021, Davy et al., 16 Jun 2025).

4. Applications Across Domains

JFB has been adopted in a range of settings where models are defined implicitly or where memory efficiency is crucial:

Inverse Imaging and Image Deblurring: DEQ models using JFB achieve competitive PSNR and SSIM, outperforming total variation and Plug-and-Play baselines and are within reach of full implicit DEQ models at a fraction of the computation and memory overhead (Liu et al., 2024).
Neural Network Quantization: IDKM-JFB enables soft- $\Theta$ 6-means quantization of large models such as ResNet-18 on constrained hardware, with backward time and memory independent of the number of $\Theta$ 7-means iterations and minimal performance loss compared to full DKM (Jaffe et al., 2023).
Bilevel and Hyperparameter Optimization: JFB has been used to differentiate through optimization-based inner loops for learning hyperparameters, stepsizes, and weights in imaging (Davy et al., 16 Jun 2025).
Differentiating Through Optimization Layers in Combinatorial Problems: JFB enables end-to-end learning with decision-focused ILP surrogates solved via Davis-Yin splitting, allowing regularized shortest paths and knapsack layers to scale efficiently (McKenzie et al., 2023).
Optimal Control with Implicit Hamiltonians: JFB supports training of high-dimensional value-function networks when the control law is defined implicitly via the maximum principle, both with sample-wise and stochastic descent guarantees (Gelphman et al., 1 Oct 2025, Gelphman et al., 31 Jan 2026).

In all these applications, JFB enables learning in regimes where memory consumption, regular autodiff, or solver inversion are prohibitive.

5. Empirical Evidence and Computational Impact

Experimental results uniformly show that JFB drastically reduces memory and runtime overhead, with modest or negligible loss in solution quality:

In image deblurring ( $\Theta$ 8 RGB images), JFB achieves $\Theta$ 9 PSNR and $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 0 SSIM, on par with Plug-and-Play ( $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 1/ $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 2) and much faster than full implicit models ( $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 3/ $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 4). JFB remains $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 5– $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 6 faster than full-implicit backprop at large image sizes (Liu et al., 2024).
In quantization, IDKM-JFB enables 100-epoch training of a 2-layer conv net on MNIST in $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 7 s (vs. $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 8 s for implicit, $\frac{\partial \ell}{\partial \Theta} = \frac{\partial \ell}{\partial x^*} \left( I - J_{x}T_\Theta(x^*) \right)^{-1} \frac{\partial T_\Theta(x^*)}{\partial \Theta}$ 9 s for full), with top-1 accuracy of $J_{x}T_\Theta(x^*)$ 0 (vs. $J_{x}T_\Theta(x^*)$ 1) (Jaffe et al., 2023).
For optimal control, JFB matches or outperforms AD and KKT-based methods in sample efficiency and memory, and remains feasible at state dimensions where other methods run out of memory (>1000 dimensions in multi-agent control) (Gelphman et al., 31 Jan 2026, Gelphman et al., 1 Oct 2025).
In combinatorial settings, DYS-JFB enables training on grid graphs with nearly $J_{x}T_\Theta(x^*)$ 2 variables (shortest-path) in $J_{x}T_\Theta(x^*)$ 3 day (vs. a week for alternative methods), and delivers the lowest normalized regret on path and knapsack benchmarks (McKenzie et al., 2023).

JFB’s memory remains constant in the number of fixed-point iterations or network depth; computation is dominated by one forward-mode and one reverse-mode autodiff pass (Liu et al., 2024, Fung et al., 2021).

6. Limitations and Considerations

Despite its practical advantages, JFB presents several theoretical and functional limitations:

Quality of Approximation: JFB is a first-order approximation; its accuracy is directly tied to the contractivity of the underlying operator. For maps with $J_{x}T_\Theta(x^*)$ 4 near $J_{x}T_\Theta(x^*)$ 5, the error may become significant, and solution quality may lag behind exact implicit differentiation, as observed in small PSNR/SSIM gaps in imaging applications (Liu et al., 2024, Davy et al., 16 Jun 2025).
Convergence Guarantees: Theoretical descent guarantees require the operator to be sufficiently contractive and the iteration to be well-conditioned. Less contractive maps or nonexpansive fixed-point operators may not guarantee alignment between the JFB direction and the true gradient (Davy et al., 16 Jun 2025, Fung et al., 2021).
Restricted Applicability: For some non-monotonic or highly nonconvex models, the omission of $J_{x}T_\Theta(x^*)$ 6 can induce bias that prevents convergence to the optimum; theory confirms only local descent directions, rather than global optimality or stationarity.
Trade-off Tuning: There is an explicit trade-off between computational speed and gradient accuracy; higher-order truncations of the Neumann expansion can be considered but reintroduce cost (Davy et al., 16 Jun 2025, Nguyen et al., 2023).

7. Extensions and Future Directions

Extensions of JFB include but are not limited to:

Restarted and Hybrid Schemes: Combining truncated unrolled iteration with JFB (e.g., the ReTune method) allows equilibrium approximation and gradient estimation to be balanced for higher accuracy, shrinking the gap to exact DEQ training (Davy et al., 16 Jun 2025).
Operator-Splitting Algorithms: JFB applies to a range of contractive fixed-point solvers, such as those in ADMM, forward-backward splitting, and Davis-Yin schemes for constrained and combinatorial optimization (McKenzie et al., 2023).
Variance-Bias Trade-off Strategies: In broader backpropagation-free or forward-mode gradient estimation, JFB-inspired approaches manipulate the structure of upstream Jacobians to yield low-variance, low-bias gradient proxies in settings where ordinary forward random projections fail to scale (Wang et al., 5 Nov 2025).
Forward-Gradient and One-Step Differentiation: The philosophy underlying JFB is closely connected to one-step (stop-gradient) differentiation, which generalizes to many iterative, bilevel, and control algorithms, conferring both computational benefits and theoretical clarity (Bolte et al., 2023, Baydin et al., 2022).

Ongoing research aims to relax contractivity requirements, analyze stochastic and mini-batch limits, and integrate adaptive or low-rank approximations to further bridge computational and statistical performance (Gelphman et al., 31 Jan 2026, Davy et al., 16 Jun 2025).

References:

(Liu et al., 2024): Training Implicit Networks for Image Deblurring using Jacobian-Free Backpropagation
(Jaffe et al., 2023): IDKM: Memory Efficient Neural Network Quantization via Implicit, Differentiable k-Means
(Davy et al., 16 Jun 2025): Restarted contractive operators to learn at equilibrium
(Bolte et al., 2023): One-step differentiation of iterative algorithms
(Fung et al., 2021): JFB: Jacobian-Free Backpropagation for Implicit Networks
(McKenzie et al., 2023): Differentiating Through Integer Linear Programs with Quadratic Regularization and Davis-Yin Splitting
(Gelphman et al., 1 Oct 2025): End-to-end Training of High-Dimensional Optimal Control with Implicit Hamiltonians via Jacobian-Free Backpropagation
(Gelphman et al., 31 Jan 2026): On the Convergence of Jacobian-Free Backpropagation for Optimal Control Problems with Implicit Hamiltonians
(Baydin et al., 2022): Gradients without Backpropagation
(Wang et al., 5 Nov 2025): Towards Scalable Backpropagation-Free Gradient Estimation