Anderson-Accelerated Iterations

Updated 25 December 2025

Anderson-accelerated iterations are a fixed-point extrapolation method that blends previous iterates and residuals to accelerate convergence.
They effectively boost the linear convergence of iterative maps and, in linear cases, relate to restarted GMRES with near-optimal reduction in error.
Variants incorporating damping, regularization, and preconditioning extend AA’s robust application in optimization, PDEs, machine learning, and simulation.

Anderson acceleration (AA), introduced by D. G. Anderson in 1965, is a fixed-point extrapolation technique that accelerates the convergence of iterative maps by combining the information of several previous iterates and residuals. The method underpins modern strategies in nonlinear solver acceleration and is used in diverse domains including numerical PDEs, optimization, machine learning, and scientific computing. AA generalizes classical scalar sequence acceleration such as Aitken’s Δ² process to vector-valued fixed-point problems. It is especially effective at boosting the convergence of linearly (contractively) convergent fixed-point methods, but generally not for quadratically convergent (Newton-like) iterations.

1. Mathematical Formulation and Algorithmic Structure

Given a fixed-point iteration $x_{k+1} = F(x_k)$ in $\mathbb{R}^n$ , Anderson acceleration of depth $m$ stores the last $m+1$ iterates and their residuals $r_i = F(x_i) - x_i$ . At iteration $k$ , AA constructs a new iterate as an affine combination of the $F(x_{k-m}),\dots,F(x_k)$ , choosing coefficients $\{\alpha_i\}_{i=0}^m$ to minimize the norm of the weighted residual sum subject to $\sum_{i=0}^m \alpha_i = 1$ : $\alpha^{k+1} = \arg\min_{\alpha\in\mathbb{R}^{m+1},\,\sum \alpha_i=1} \|\Delta_k \alpha\|_2$ where $\Delta_k = [r_{k-m},\dots,r_k]\in\mathbb{R}^{n\times (m+1)}$ . The next iterate is set as: $x_{k+1} = \sum_{i=0}^m \alpha_i^{k+1} F(x_{k-m+i})$ This minimization is a constrained least-squares problem and often regularized (Tikhonov or damping), particularly in ill-conditioned settings: $\alpha^{k+1} = \frac{(\Delta_k^\top\Delta_k + \lambda I)^{-1} 1}{1^\top(\Delta_k^\top\Delta_k + \lambda I)^{-1} 1}$ with an optional damping parameter $\beta\in(0,1]$ , so $x_{k+1} = \beta\sum \alpha_i F(x_{k-m+i}) + (1-\beta)x_k$ (Geist et al., 2018, Saad, 15 Jul 2025).

The update step requires only a $(m+1)\times(m+1)$ system, making it computationally light for moderate $m$ (usually $m\lesssim10$ ).

2. Convergence Theory and Optimality Mechanisms

If $F$ is a contraction— $\|F(x)-F(y)\| \leq \gamma \|x-y\|$ with $\gamma<1$ —AA guarantees global convergence to the unique fixed point. In the linear case $F(x)=Ax+b$ , Anderson acceleration with full history is equivalent to restarted GMRES, optimizing the reduction of the spectral radius within the Krylov subspace formed by recent residuals (Pollock et al., 2018, Saad, 15 Jul 2025). This equivalence yields near-optimal convergence rates, including Chebyshev-optimal bounds for symmetric positive definite operators: $\|r^{AA}_{k+1}\|_2 \leq 2 \|I-\beta A\|_2 \left(\frac{\sqrt{\kappa(A)}-1}{\sqrt{\kappa(A)}+1}\right)^k \|r_0\|_2$ where $\kappa(A)$ is the condition number (Tang et al., 22 Mar 2024).

For general nonlinear contractive $F$ , AA preserves linear (R-linear) convergence, with the effective contraction rate at each step reduced by the gain $\theta_k = \frac{\|\sum \alpha_j r_j\|}{\|r_k\|} < 1$ : $\|w_{k+1}\| \leq \theta_k[(1-\beta_{k-1})+\beta_{k-1}\kappa] \|w_k\| + O(\|w_k\|^2 + \|w_{k-1}\|^2)$ This theoretical result establishes that AA provably improves the linear rate by the gain at each step and can also enlarge the domain of attraction for noncontractive problems when combined with moderate damping (Evans et al., 2018, Pollock et al., 2018).

In contrast, for quadratically convergent (Newton-like) iterations, AA's higher-order terms typically degrade the quadratic convergence rate to linear, accounting for the absence of improvement in those settings (Evans et al., 2018).

3. Robustness, Practical Implementation, and Variants

Anderson acceleration is robust under modest regularization and damping. In stiff or noncontractive problems, resetting the history (adaptive restart) and regularization of the least-squares step prevent divergence. For ill-conditioned systems, a Tikhonov regularizer $\lambda I$ stabilizes the constraint matrix.

Variants of the basic scheme include:

Preconditioned AA (PAA): Inserting a preconditioner $P_k$ into the residuals improves convergence; full-Jacobian PAA recovers Newton’s method, diagonal or block-diagonal approximations achieve a balance of cost and speedup (Chen et al., 2023).
Low-synchronization and truncated orthogonalization: To reduce parallel communication and memory, the history of residual differences can be maintained with low-synchronization QR kernels or by Truncated Gram-Schmidt (AATGS) with minimal additional cost, yielding three-term recurrences for symmetric linear systems (Lockhart et al., 2021, Tang et al., 22 Mar 2024).
Approximate and reduced AA: Accuracy-tuned approximations to the least-squares step and dimension reduction via random sketching or row sampling enable application in extreme-scale contexts while preserving convergence guarantees (Pasini et al., 2022).
Norm modification: For operator spectra with challenging mode structure, AA steps computed in a weighted Sobolev norm (e.g., $\mathcal{H}^{-2}$ for elliptic operators) can yield superior convergence to standard $L^2$ (Yang et al., 2020).

4. Applications and Impact

Anderson acceleration is employed in a broad array of domains:

Reinforcement Learning: Applied to value iteration for Markov Decision Processes (MDPs), AA accelerates Bellman fixed-point iterations, with empirical results showing 2–4× speed-up in value-convergence error. Integration with deep RL (e.g., DQNs) exploits standard AA over target-network updates for improved sample efficiency (Geist et al., 2018).
Computer Graphics and Simulation: In geometry optimization/physics simulation, local-global solvers and projective dynamics benefit from AA, achieving 3–10× reductions in iteration count and 2–4× savings in wall-clock time, with monotonic energy safeguards guaranteeing global convergence (Peng et al., 2018).
Clustering and Statistical Estimation: Lloyd’s k-Means algorithm, interpreted as a fixed-point map, is greatly accelerated by AA; with dynamic depth adjustment, it yields consistent 20–50% speedups across real and synthetic datasets (Zhang et al., 2018).
Nonsmooth Composite Optimization: AA establishes local R-linear convergence in nonsmooth problems characterized by active manifold identification, such as proximal- or reweighted $\ell_1$ methods, Douglas-Rachford splitting, and coordinate descent for SVMs, without requiring KL-type assumptions (Li et al., 12 Oct 2024, Li, 12 Mar 2024).
PDEs and Scientific Computing: AA is systematically used to accelerate iterative solvers for nonlinear PDEs (Bratu, Navier-Stokes), transport equations, and large-scale seismic inversion, sometimes enabling convergence in cases where Newton or Picard fail or are too slow (Pollock et al., 2018, Yang, 2020).
Optimization and Machine Learning: Embedded into gradient descent, energy-adaptive gradient methods (AEGD), and even classical linear SVM training, AA enables reduced iteration counts (typically by factors of 2–5), with robust behavior in both convex and nonconvex landscapes (Liu et al., 2022, Ali et al., 2023, Ouyang et al., 2022).

5. Algorithmic Parameters and Recommendations

The practical performance of AA depends on several tunable parameters:

Window/depth $m$ : Values in the range $3 \leq m \leq 10$ typically yield substantial speedups without ill-conditioning; in highly symmetric problems (e.g., SPD linear systems), $m\approx 3$ suffices using truncated Gram-Schmidt.
Regularization $\lambda$ : Small ( $10^{-6}$ – $10^{-3}$ ) to prevent least-squares ill-conditioning in the mixing weights.
Damping $\beta$ : Full $\beta=1$ is effective for well-behaved systems; $\beta \in [0.5,1]$ for stability in more challenging or noncontractive settings.
Restart/Reset: To avoid loss of effectiveness or instability, periodically reset the stored history, especially if the gain $\theta_k$ is near or exceeds $1$ or if coefficients grow unreasonably.
Computational Overhead: Each AA step requires at most $\mathcal{O}(n m^2 + m^3)$ flops, negligible for small $m$ compared to the primary operator evaluation.

6. Performance, Extensions, and Quantitative Results

Extensive empirical results validate the broad applicability and efficiency gain of Anderson acceleration:

For value iteration in MDPs: with $m=5$ , normalized $\ell_1$ -error is reduced by a factor of 3× at iteration $k=50$ compared to vanilla ( $2.5\times 10^{-1}$ vs. $8\times 10^{-2}$ ) (Geist et al., 2018).
For geometry optimization and simulation: AA reduces iteration counts by 3–10× and wall-clock time by 2–4× on standard local-global solvers (Peng et al., 2018).
For Lloyd’s k-Means: AA reduces total CPU time by over 33% averaged across 120 test cases, with dynamic $m$ yielding 20–30% additional time reductions compared to static $m$ (Zhang et al., 2018).
For iteratively reweighted $\ell_1$ methods: AA achieves R-linear convergence in theory and 2× speedup over Nesterov-accelerated IRL1 in practice, without KL assumptions (Li, 12 Mar 2024).
In seismic inversion, AA yields 3–5× faster convergence than steepest descent and is competitive with L-BFGS and GMRES at equivalent memory cost (Yang, 2020).

AA is compatible with preconditioning, composite-acceleration (AA inside quasi-Newton), and stochastic settings when equipped with regularization and robust updating (e.g., moving-average smoothing in deep learning to cope with noise) (Pasini et al., 2021).

Anderson acceleration generalizes simple mixing, multisecant quasi-Newton methods, and Pulay/DIIS (direct inversion in the iterative subspace) schemes. In the linear, full-history limit, AA is algebraically equivalent to restarted or truncated GMRES and, in the SPD setting, to polynomial accelerators (e.g., Chebyshev iteration). The key theoretical distinction is that AA constructs an implicit low-memory multisecant approximation of the inverse Jacobian, enforcing all secant conditions in a single linear least-squares projection, rather than sequentially as in L-BFGS or classical Broyden updates (Saad, 15 Jul 2025, Peng et al., 2018, Chen et al., 2023).

AA is also amenable to norm modification for nonstandard spectral distributions, e.g., the use of $\mathcal{H}^{-2}$ norm in elliptic PDEs focuses acceleration on slow, smooth modes (Yang et al., 2020).

In summary, Anderson-accelerated iterations provide a simple, highly general, and powerful device for boosting convergence rates throughout numerical analysis and machine learning. The method’s low overhead, ease of implementation, and strong theoretical guarantees in the contractive regime (and, with safeguards, beyond) make it a default choice for accelerating black-box or operator-based fixed-point solvers (Saad, 15 Jul 2025, Evans et al., 2018, Li et al., 12 Oct 2024, Geist et al., 2018).