Papers
Topics
Authors
Recent
2000 character limit reached

Anderson-Accelerated Iterations

Updated 25 December 2025
  • Anderson-accelerated iterations are a fixed-point extrapolation method that blends previous iterates and residuals to accelerate convergence.
  • They effectively boost the linear convergence of iterative maps and, in linear cases, relate to restarted GMRES with near-optimal reduction in error.
  • Variants incorporating damping, regularization, and preconditioning extend AA’s robust application in optimization, PDEs, machine learning, and simulation.

Anderson-Accelerated Iterations

Anderson acceleration (AA), introduced by D. G. Anderson in 1965, is a fixed-point extrapolation technique that accelerates the convergence of iterative maps by combining the information of several previous iterates and residuals. The method underpins modern strategies in nonlinear solver acceleration and is used in diverse domains including numerical PDEs, optimization, machine learning, and scientific computing. AA generalizes classical scalar sequence acceleration such as Aitken’s Δ² process to vector-valued fixed-point problems. It is especially effective at boosting the convergence of linearly (contractively) convergent fixed-point methods, but generally not for quadratically convergent (Newton-like) iterations.

1. Mathematical Formulation and Algorithmic Structure

Given a fixed-point iteration xk+1=F(xk)x_{k+1} = F(x_k) in Rn\mathbb{R}^n, Anderson acceleration of depth mm stores the last m+1m+1 iterates and their residuals ri=F(xi)xir_i = F(x_i) - x_i. At iteration kk, AA constructs a new iterate as an affine combination of the F(xkm),,F(xk)F(x_{k-m}),\dots,F(x_k), choosing coefficients {αi}i=0m\{\alpha_i\}_{i=0}^m to minimize the norm of the weighted residual sum subject to i=0mαi=1\sum_{i=0}^m \alpha_i = 1: αk+1=argminαRm+1,αi=1Δkα2\alpha^{k+1} = \arg\min_{\alpha\in\mathbb{R}^{m+1},\,\sum \alpha_i=1} \|\Delta_k \alpha\|_2 where Δk=[rkm,,rk]Rn×(m+1)\Delta_k = [r_{k-m},\dots,r_k]\in\mathbb{R}^{n\times (m+1)}. The next iterate is set as: xk+1=i=0mαik+1F(xkm+i)x_{k+1} = \sum_{i=0}^m \alpha_i^{k+1} F(x_{k-m+i}) This minimization is a constrained least-squares problem and often regularized (Tikhonov or damping), particularly in ill-conditioned settings: αk+1=(ΔkΔk+λI)111(ΔkΔk+λI)11\alpha^{k+1} = \frac{(\Delta_k^\top\Delta_k + \lambda I)^{-1} 1}{1^\top(\Delta_k^\top\Delta_k + \lambda I)^{-1} 1} with an optional damping parameter β(0,1]\beta\in(0,1], so xk+1=βαiF(xkm+i)+(1β)xkx_{k+1} = \beta\sum \alpha_i F(x_{k-m+i}) + (1-\beta)x_k (Geist et al., 2018, Saad, 15 Jul 2025).

The update step requires only a (m+1)×(m+1)(m+1)\times(m+1) system, making it computationally light for moderate mm (usually m10m\lesssim10).

2. Convergence Theory and Optimality Mechanisms

If FF is a contraction—F(x)F(y)γxy\|F(x)-F(y)\| \leq \gamma \|x-y\| with γ<1\gamma<1—AA guarantees global convergence to the unique fixed point. In the linear case F(x)=Ax+bF(x)=Ax+b, Anderson acceleration with full history is equivalent to restarted GMRES, optimizing the reduction of the spectral radius within the Krylov subspace formed by recent residuals (Pollock et al., 2018, Saad, 15 Jul 2025). This equivalence yields near-optimal convergence rates, including Chebyshev-optimal bounds for symmetric positive definite operators: rk+1AA22IβA2(κ(A)1κ(A)+1)kr02\|r^{AA}_{k+1}\|_2 \leq 2 \|I-\beta A\|_2 \left(\frac{\sqrt{\kappa(A)}-1}{\sqrt{\kappa(A)}+1}\right)^k \|r_0\|_2 where κ(A)\kappa(A) is the condition number (Tang et al., 22 Mar 2024).

For general nonlinear contractive FF, AA preserves linear (R-linear) convergence, with the effective contraction rate at each step reduced by the gain θk=αjrjrk<1\theta_k = \frac{\|\sum \alpha_j r_j\|}{\|r_k\|} < 1: wk+1θk[(1βk1)+βk1κ]wk+O(wk2+wk12)\|w_{k+1}\| \leq \theta_k[(1-\beta_{k-1})+\beta_{k-1}\kappa] \|w_k\| + O(\|w_k\|^2 + \|w_{k-1}\|^2) This theoretical result establishes that AA provably improves the linear rate by the gain at each step and can also enlarge the domain of attraction for noncontractive problems when combined with moderate damping (Evans et al., 2018, Pollock et al., 2018).

In contrast, for quadratically convergent (Newton-like) iterations, AA's higher-order terms typically degrade the quadratic convergence rate to linear, accounting for the absence of improvement in those settings (Evans et al., 2018).

3. Robustness, Practical Implementation, and Variants

Anderson acceleration is robust under modest regularization and damping. In stiff or noncontractive problems, resetting the history (adaptive restart) and regularization of the least-squares step prevent divergence. For ill-conditioned systems, a Tikhonov regularizer λI\lambda I stabilizes the constraint matrix.

Variants of the basic scheme include:

  • Preconditioned AA (PAA): Inserting a preconditioner PkP_k into the residuals improves convergence; full-Jacobian PAA recovers Newton’s method, diagonal or block-diagonal approximations achieve a balance of cost and speedup (Chen et al., 2023).
  • Low-synchronization and truncated orthogonalization: To reduce parallel communication and memory, the history of residual differences can be maintained with low-synchronization QR kernels or by Truncated Gram-Schmidt (AATGS) with minimal additional cost, yielding three-term recurrences for symmetric linear systems (Lockhart et al., 2021, Tang et al., 22 Mar 2024).
  • Approximate and reduced AA: Accuracy-tuned approximations to the least-squares step and dimension reduction via random sketching or row sampling enable application in extreme-scale contexts while preserving convergence guarantees (Pasini et al., 2022).
  • Norm modification: For operator spectra with challenging mode structure, AA steps computed in a weighted Sobolev norm (e.g., H2\mathcal{H}^{-2} for elliptic operators) can yield superior convergence to standard L2L^2 (Yang et al., 2020).

4. Applications and Impact

Anderson acceleration is employed in a broad array of domains:

  • Reinforcement Learning: Applied to value iteration for Markov Decision Processes (MDPs), AA accelerates Bellman fixed-point iterations, with empirical results showing 2–4× speed-up in value-convergence error. Integration with deep RL (e.g., DQNs) exploits standard AA over target-network updates for improved sample efficiency (Geist et al., 2018).
  • Computer Graphics and Simulation: In geometry optimization/physics simulation, local-global solvers and projective dynamics benefit from AA, achieving 3–10× reductions in iteration count and 2–4× savings in wall-clock time, with monotonic energy safeguards guaranteeing global convergence (Peng et al., 2018).
  • Clustering and Statistical Estimation: Lloyd’s k-Means algorithm, interpreted as a fixed-point map, is greatly accelerated by AA; with dynamic depth adjustment, it yields consistent 20–50% speedups across real and synthetic datasets (Zhang et al., 2018).
  • Nonsmooth Composite Optimization: AA establishes local R-linear convergence in nonsmooth problems characterized by active manifold identification, such as proximal- or reweighted 1\ell_1 methods, Douglas-Rachford splitting, and coordinate descent for SVMs, without requiring KL-type assumptions (Li et al., 12 Oct 2024, Li, 12 Mar 2024).
  • PDEs and Scientific Computing: AA is systematically used to accelerate iterative solvers for nonlinear PDEs (Bratu, Navier-Stokes), transport equations, and large-scale seismic inversion, sometimes enabling convergence in cases where Newton or Picard fail or are too slow (Pollock et al., 2018, Yang, 2020).
  • Optimization and Machine Learning: Embedded into gradient descent, energy-adaptive gradient methods (AEGD), and even classical linear SVM training, AA enables reduced iteration counts (typically by factors of 2–5), with robust behavior in both convex and nonconvex landscapes (Liu et al., 2022, Ali et al., 2023, Ouyang et al., 2022).

5. Algorithmic Parameters and Recommendations

The practical performance of AA depends on several tunable parameters:

  • Window/depth mm: Values in the range 3m103 \leq m \leq 10 typically yield substantial speedups without ill-conditioning; in highly symmetric problems (e.g., SPD linear systems), m3m\approx 3 suffices using truncated Gram-Schmidt.
  • Regularization λ\lambda: Small (10610^{-6}10310^{-3}) to prevent least-squares ill-conditioning in the mixing weights.
  • Damping β\beta: Full β=1\beta=1 is effective for well-behaved systems; β[0.5,1]\beta \in [0.5,1] for stability in more challenging or noncontractive settings.
  • Restart/Reset: To avoid loss of effectiveness or instability, periodically reset the stored history, especially if the gain θk\theta_k is near or exceeds $1$ or if coefficients grow unreasonably.
  • Computational Overhead: Each AA step requires at most O(nm2+m3)\mathcal{O}(n m^2 + m^3) flops, negligible for small mm compared to the primary operator evaluation.

6. Performance, Extensions, and Quantitative Results

Extensive empirical results validate the broad applicability and efficiency gain of Anderson acceleration:

  • For value iteration in MDPs: with m=5m=5, normalized 1\ell_1-error is reduced by a factor of 3× at iteration k=50k=50 compared to vanilla (2.5×1012.5\times 10^{-1} vs. 8×1028\times 10^{-2}) (Geist et al., 2018).
  • For geometry optimization and simulation: AA reduces iteration counts by 3–10× and wall-clock time by 2–4× on standard local-global solvers (Peng et al., 2018).
  • For Lloyd’s k-Means: AA reduces total CPU time by over 33% averaged across 120 test cases, with dynamic mm yielding 20–30% additional time reductions compared to static mm (Zhang et al., 2018).
  • For iteratively reweighted 1\ell_1 methods: AA achieves R-linear convergence in theory and 2× speedup over Nesterov-accelerated IRL1 in practice, without KL assumptions (Li, 12 Mar 2024).
  • In seismic inversion, AA yields 3–5× faster convergence than steepest descent and is competitive with L-BFGS and GMRES at equivalent memory cost (Yang, 2020).

AA is compatible with preconditioning, composite-acceleration (AA inside quasi-Newton), and stochastic settings when equipped with regularization and robust updating (e.g., moving-average smoothing in deep learning to cope with noise) (Pasini et al., 2021).

Anderson acceleration generalizes simple mixing, multisecant quasi-Newton methods, and Pulay/DIIS (direct inversion in the iterative subspace) schemes. In the linear, full-history limit, AA is algebraically equivalent to restarted or truncated GMRES and, in the SPD setting, to polynomial accelerators (e.g., Chebyshev iteration). The key theoretical distinction is that AA constructs an implicit low-memory multisecant approximation of the inverse Jacobian, enforcing all secant conditions in a single linear least-squares projection, rather than sequentially as in L-BFGS or classical Broyden updates (Saad, 15 Jul 2025, Peng et al., 2018, Chen et al., 2023).

AA is also amenable to norm modification for nonstandard spectral distributions, e.g., the use of H2\mathcal{H}^{-2} norm in elliptic PDEs focuses acceleration on slow, smooth modes (Yang et al., 2020).

In summary, Anderson-accelerated iterations provide a simple, highly general, and powerful device for boosting convergence rates throughout numerical analysis and machine learning. The method’s low overhead, ease of implementation, and strong theoretical guarantees in the contractive regime (and, with safeguards, beyond) make it a default choice for accelerating black-box or operator-based fixed-point solvers (Saad, 15 Jul 2025, Evans et al., 2018, Li et al., 12 Oct 2024, Geist et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Anderson-Accelerated Iterations.