Truncated Back-propagation for Bilevel Optimization

Updated 3 December 2025

The paper introduces truncated back-propagation as a method to approximate hypergradients using partial inner-loop unrolling, reducing computational burden.
It details two paradigms—iterative differentiation with trajectory truncation and truncated Neumann adjoint—highlighting trade-offs between bias, memory, and runtime.
Empirical results demonstrate that moderate truncation yields near-exact performance across meta-learning, vision, and hyperparameter tasks.

Truncated back-propagation for bilevel optimization is a family of algorithmic and analytical techniques that address the computational bottleneck of differentiating through costly lower-level optimization procedures within hierarchical models. Rather than differentiating through every step of the lower-level solver—whether via reverse-mode automatic differentiation or implicit function methods—truncated back-propagation leverages partial or approximate trajectories along the inner dynamics to construct biased, but computationally efficient, estimators of the hypergradient. This approach enables scalable meta-optimization and hyperparameter learning in complex and large-scale machine learning tasks, often with mild trade-offs between memory usage, runtime and convergence accuracy.

1. Bilevel Problem Formulation and the Hypergradient

The bilevel optimization structure appears in diverse learning applications, formalized as: $\min_{x\in\mathbb{R}^d}\, F(x) := f(x, y^*(x)) + R(x), \qquad y^*(x) = \arg\min_{y\in\mathbb{R}^m}\, g(x, y)$ where $x$ is the upper-level variable, $y$ is the lower-level variable, $R(x)$ is a convex (possibly nonsmooth) regularizer, $f$ is the upper-level loss, and $g$ is the lower-level objective. The key computational step is evaluating the hypergradient: $\nabla_x F(x) = \nabla_x f(x, y^*(x)) + \nabla_{xy} g(x, y^*(x))^\top \left[\nabla^2_{yy} g(x, y^*(x))\right]^{-1} \nabla_y f(x, y^*(x)) + \partial R(x)$ In high-dimensional settings and when the lower-level solution $y^*(x)$ is approximated by an iterative solver, exact hypergradient evaluation is computationally prohibitive, driving the development of truncated back-propagation methods (Shaban et al., 2018, Suonperä et al., 2022, Giovannelli et al., 2021).

2. Truncated Back-Propagation Algorithms

Truncated back-propagation proceeds by restricting sensitivity propagation to a finite number $K$ of inner-optimization steps, yielding an approximate hypergradient based on partial trajectory information. Two main algorithmic paradigms have emerged:

Iterative Differentiation with Trajectory Truncation: Rather than unrolling the entire inner loop for reverse-mode differentiation, one truncates back-propagation to the last $K$ steps, forming the estimator

$h_{T-K} \approx \nabla_\lambda f(w_T, \lambda) + \sum_{t=T-K+1}^{T} B_t \prod_{s=t+1}^{T} A_s \nabla_{w_T} f$

where $A_t$ and $B_t$ are Jacobians of the lower-level update mapping (Shaban et al., 2018). This reduces memory from $O(M T)$ to $O(MK)$ , with bias decaying exponentially in $K$ .

Neumann Series Truncated Adjoint: The Hessian inversion in the implicit gradient formula is approximated using the Neumann series truncated to $q$ terms: $\left[\nabla^2_{yy} g\right]^{-1} \approx \sum_{i=0}^{q} (I - \nabla^2_{yy} g)^i$ leading to an adjoint step that is both lower-complexity and tunable by $q$ for bias/accuracy tradeoff (Giovannelli et al., 2021, Suonperä et al., 2022).

Algorithmic schemes such as FEFB (Forward–Exact–Forward–Backward) and FIFB (Forward–Inexact–Forward–Backward) exemplify truncated back-propagation by executing only a single inner update and an adjoint step per outer update, with the adjoint solved either exactly or via a single gradient step, corresponding to truncating the Neumann expansion at zero or one term (Suonperä et al., 2022).

3. Theoretical Guarantees and Bias–Accuracy Tradeoff

Truncated back-propagation introduces a bias in the hypergradient estimate, but under standard regularity—strong convexity and smoothness—a range of guarantees are established:

Exponential Bias Decay: The deviation between the truncated and exact hypergradient decreases as $(1-\gamma \alpha)^K$ , where $\gamma$ is the inner step size and $\alpha$ the strong convexity constant of the lower-level problem (Shaban et al., 2018).
Convergence Rates: Provided the truncation depth $K$ or Neumann term $q$ is taken $O(\log(1/\epsilon))$ , one obtains convergence to an $\epsilon$ -stationary meta-parameter with either SGD-style or forward–backward splitting meta-updates (Shaban et al., 2018, Suonperä et al., 2022, Giovannelli et al., 2021).
Linear Local Convergence: For proximal-splitting-based schemes (FEFB/FIFB), linear convergence to a local solution is proven under strong convexity and prox-contractivity conditions, with explicit contraction constants for both inner and adjoint updates (Suonperä et al., 2022).
Control of Truncation Error: In truncated Neumann adjoint schemes, selecting $q$ such that the residual $\|\tilde{R}_q\| = O(\alpha_k)$ (meta-stepsize) suffices for optimality, with asymptotic or near-optimal rates (strongly convex/convex) recovered (Giovannelli et al., 2021).

Notably, in the absence of strong convexity or when the lower-level problem is nonconvex, the pessimistic trajectory truncation framework augments truncated back-propagation by selecting the worst-case lower-level iterate for differentiation, and jointly learning the initialization to ensure stationary-point coverage (Liu et al., 2023, Liu et al., 2021). This enables global convergence proofs under minimal regularity.

4. Computational Complexity and Practical Implementation

The major motivation for truncation is computational tractability in high-dimensional or large-scale bilevel models. The complexity regimes are as follows:

Method	Time per UL iter.	Memory Usage	Bias Source
Full Reverse Unrolling	$O(cT)$	$O(MT)$	None
Implicit Function	$O(M^3)$	$O(M)$	Inner/KKT approx
Truncated RMD ( $K$ )	$O(cK)$	$O(MK)$	Trajectory truncation
Neumann Series ( $q$ )	$O(q m^2)$	$O(m^2)$	Hessian truncation
FEFB/FIFB	$O(m^2)$ or $O(m^3)$	$O(m^2)$	One-step truncation
FG²U	$O(b T \text{ cost}_\text{forward})$	$O(M + b N)$	Only variance

In practice, truncated back-propagation with moderate $K$ (5–10) yields comparable meta-optimization results to exact methods, reducing runtime and memory usage by orders of magnitude across hyperparameter learning, meta-learning, and vision tasks (Shaban et al., 2018, Suonperä et al., 2022). For large-scale deep learning, strict memory constraints favor directional-unrolling methods such as FG²U, which are strictly unbiased, embarrassingly parallel, and readily implementable in JAX/PyTorch (Shen et al., 20 Jun 2024).

5. Extensions to Nonconvexity, Initialization Auxiliary, and Trajectory Selection

Standard truncated back-propagation assumes strong convexity in the lower-level problem. To relax this, recent frameworks augment the inner trajectory by:

Initialization Auxiliary (IA): Jointly learning the starting point for the inner solver (rather than a fixed initialization), yielding tighter LL residuals and enhanced global convergence, especially when the LL is nonconvex (Liu et al., 2021, Liu et al., 2023).
Pessimistic Trajectory Truncation (PTT): Selecting the iterate along the inner loop that maximizes the upper-level objective, and differentiating only through the last $K-T$ steps, which mitigates the harm of poor early iterates and avoids oscillatory outer updates in highly nonconvex settings (Liu et al., 2023, Liu et al., 2021).
Accelerated Inner Dynamics: Adopting Nesterov acceleration for convex LL, which accelerates ergodic gap decay from $O(1/K)$ to $O(1/K^2)$ , and specialized iterative mappings for nonconvex LL (Liu et al., 2023).

These strategies collectively ensure asymptotic consistency of outer solutions for nonconvex and challenging LL landscapes, extending the practical applicability of truncated back-propagation.

6. Empirical Performance and Practical Recommendations

Extensive empirical evaluation across meta-learning (Omniglot, CIFAR), vision (TV denoising, blind deconvolution), and hypercleaning tasks demonstrates that truncated back-propagation achieves near-exact performance with much lower compute/memory requirements (Shaban et al., 2018, Suonperä et al., 2022, Shen et al., 20 Jun 2024).

Practical recommendations include:

Choosing truncation window $K$ via exponential bias decay heuristics, increasing only when hypergradient error stalls.
Monitoring cosine similarity of truncated to full gradients, ensuring descent directionality.
Employing small $K$ for most of the bias reduction at minimal cost.
For the Neumann adjoint, $q=O(\ln(1/\alpha_k))$ is sufficient for convergence.
For FG²U, balancing the number of sampled directions $b$ to control variance while leveraging parallel hardware.

The practical tractability and flexibility of truncated back-propagation have made it a foundational tool for bilevel optimization in modern large-scale machine learning (Shaban et al., 2018, Suonperä et al., 2022, Shen et al., 20 Jun 2024, Giovannelli et al., 2021, Liu et al., 2023, Liu et al., 2021).