FISTA: Accelerated Proximal Gradient Method

Updated 12 March 2026

FISTA is a fast iterative shrinkage-thresholding algorithm designed for composite convex optimization, achieving an optimal O(1/k²) convergence rate.
It accelerates traditional proximal gradient methods using Nesterov’s momentum and supports adaptive strategies like backtracking and block-coordinate updates.
FISTA has proven effective in applications such as LASSO, image deblurring, and matrix completion, backed by rigorous theoretical and empirical results.

Accelerated Proximal Gradient (FISTA)

The Accelerated Proximal Gradient method, most commonly referred to as FISTA (Fast Iterative Shrinkage-Thresholding Algorithm), is a first-order algorithm designed to minimize composite objectives consisting of a smooth differentiable function and a possibly nonsmooth, proximable convex penalty. FISTA is distinguished by an optimal $O(1/k^2)$ convergence rate for convex objectives, and it constitutes the canonical Nesterov-type acceleration of the basic proximal gradient (ISTA) scheme. The method has extensive applications across signal processing, statistics, machine learning, and scientific computing, delivering both theoretical guarantees and practical efficiency (Bollapragada et al., 19 Jul 2025).

1. Problem Setting and Basic Algorithm

FISTA addresses problems of the form

$\min_{x \in \mathbb{R}^d}\ F(x) := f(x) + h(x)$

where $f:\mathbb{R}^d\to\mathbb{R}$ is $L$ -smooth (i.e., $\|\nabla f(x)-\nabla f(y)\| \le L\|x-y\|$ ), convex (or strongly convex), and $h:\mathbb{R}^d\to\mathbb{R} \cup \{+\infty\}$ is closed, convex, and proximal-friendly (i.e., $\operatorname{prox}_{\alpha h}(y)$ is easily computable).

The standard FISTA iteration is: $\begin{aligned} t_{k+1} &\leftarrow \frac{1 + \sqrt{1 + 4 t_k^2}}{2} \ y_k &\leftarrow x_k + \frac{t_k - 1}{t_{k+1}} (x_k - x_{k-1}) \ x_{k+1} &\leftarrow \operatorname{prox}_{\alpha h}\left( y_k - \alpha \nabla f(y_k) \right) \end{aligned}$ with initial $x_{-1} = x_0$ , $t_0=1$ , step size $\alpha \le 1/L$ (Bollapragada et al., 19 Jul 2025, Kim et al., 2016, Kong et al., 2021).

For cases where $L$ is not known, a backtracking line-search scheme is adopted, using the local smoothness bound to adaptively select the step size (Bollapragada et al., 19 Jul 2025, Huang et al., 2024).

2. Acceleration and Convergence Rates

The choice of momentum parameters $\{t_k\}$ , set recursively via $t_{k+1} = (1 + \sqrt{1+4 t_k^2})/2$ , underpins the acceleration. This is rooted in Nesterov's estimate-sequence framework, ensuring the surrogate sequence $\Phi_k(x)$ decays as $O(1/k^2)$ , with practical extrapolation weights $(t_k-1)/t_{k+1} \approx (k-1)/(k+2)$ (Bollapragada et al., 19 Jul 2025, Kim et al., 2016, Kong et al., 2021).

Convex regime

If $f$ is convex and $L$ -smooth, and $h$ is convex and proximable, then for all $k \geq 1$ : $F(x_k) - F^* \le \frac{2 L \|x_0 - x^*\|^2}{(k+1)^2}$ This $O(1/k^2)$ function-value convergence is optimal for first-order methods (Bollapragada et al., 19 Jul 2025, Kim et al., 2016, Kong et al., 2021).

Strongly convex regime

When $f$ is additionally $\mu$ -strongly convex (with $h$ convex or strongly convex), FISTA with fixed parameters achieves a global geometric (linear) rate

$F(x_k) - F^* \le C \cdot (1 - \sqrt{\mu/L})^k$

Modifications such as fixed-momentum or adaptive restarts can further stabilize the linear phase; generalized variants admit more flexible (constant or adaptive) inertial weights without degrading this exponential convergence (Bollapragada et al., 19 Jul 2025, Li et al., 2023, Li et al., 9 Apr 2025, Kong et al., 2021).

Proximal gradient norm rates

Notably, FISTA also attains $O(1/k^3)$ convergence in the squared norm of the proximal subgradient, which leads to sharper stationarity certificates compared to the $O(1/k^2)$ of ISTA (Li et al., 2022, Kong et al., 2021).

3. Variants and Extensions

Stochastic and Inexact Gradients

In scenarios where $f$ has finite-sum or expectation structure ( $f=\sum_i f_i$ or $f= \mathbb{E}_\xi[F(x,\xi)]$ ), using unbiased or biased stochastic gradient estimators is standard. To maintain accelerated rates, the variance/bias and batch size grow adaptively to track the reduced gradient $R_\alpha(y_k) = (y_k - x_{k+1}) / \alpha$ , enforcing accuracy conditions

$\mathbb{E}\|\hat{g}_k - \nabla f(y_k)\|^2 \le (\eta_k^2/4) \cdot \mathbb{E}\|y_k-x_{k+1}\|^2/\alpha^2 + \delta_k^2$

Target rates $O(1/k^2)$ for convex and $O(\log(1/\epsilon))$ for strongly convex problems are preserved, with optimal total counts of stochastic gradient evaluations (Bollapragada et al., 19 Jul 2025).

Multiobjective and Block-Coordinate Acceleration

Multiobjective FISTA extensions use a subproblem involving the maximum over the linearization of all component objectives; duality-based reduction often enables efficient solution (Tanabe et al., 2022, Nishimura et al., 2022, Huang et al., 2024). Block coordinate FISTA generalizations update only a subset of variables per iteration, with adaptive momenta based on nonconvex/convex structure and exploiting block separability (Lau et al., 2017).

Monotonic and Adaptive Variants

Monotone FISTA (MFISTA) enforces non-increasing objective sequences by rejecting iterates that do not reduce the function value. This stabilizes convergence, particularly under inexact subproblem solutions or in nonconvex/nonmonotonic regimes (Nishimura et al., 2022, Florea et al., 2017).

Adaptive momentum and extrapolation strategies, such as parameterized inertial coefficients $(a, b)$ or line-search-based updates, provide further robustness and can guarantee convergence of iterates, not just objective values (Tanabe et al., 2022, Ochs et al., 2017).

RAPID and Hybrid Acceleration

RAPID introduces a univariate line-search in the direction of the iterates, modifying the auxiliary variable update and reducing the worst-case constant in the convergence bound relative to classical FISTA (Zhang et al., 2014).

Two-phase approaches such as NIDAAREM combine FISTA in an initial phase with Anderson acceleration (DAAREM) for local rapid convergence, leveraging monotonicity-based switching rules and restart techniques (Henderson et al., 16 Aug 2025).

4. Theoretical Foundations and Proof Techniques

FISTA’s global rates are established via estimate sequences or Lyapunov potential techniques, telescoping descent inequalities derived from $L$ -smoothness and proximal optimality. In the convex case, the proof builds a quadratic-majorization-based energy function, demonstrating $O(1/k^2)$ decay in function value (Bollapragada et al., 19 Jul 2025, Kim et al., 2016, Kong et al., 2021).

In the strongly convex regime, standard linearization is extended by adjusting momentum or by integrating restarts when empirical oscillation is detected. High-resolution ODE and phase-space analyses have clarified how adaptive Lyapunov functions deliver geometric decay even without knowledge of the strong convexity parameter (Li et al., 2023).

Performance estimation problem (PEP) frameworks rigorously characterize worst-case rates and guide algorithmic design for composite gradient norm minimization as well as cost function decrease (Kim et al., 2016).

Multiobjective and inexact variants rely on augmented merit functions, auxiliary sequence recursions, and adapted symbols to handle nontrivial subproblem and error control (Huang et al., 2024, Bello-Cruz et al., 2020).

5. Inexact and Stochastic Proximal Subproblems

FISTA variants with inexact proximal operators, such as I-FISTA and IE-FISTA, use relative error rules tied to the subproblem solution norm and ensure $O(1/k^2)$ rates without requiring summable error tolerances. These rules are incorporated seamlessly into the Lyapunov and telescoping proof apparatus, allowing subproblem accuracy to decrease as iterates approach stationarity (Bello-Cruz et al., 2020).

Stochastic accelerated PG methods increase batch sizes or adjust estimator variance per iteration, ensuring unbiased or controlled bias gradients such that the total work for both convergence and gradient computation matches the fundamental lower bounds for first-order stochastic convex minimization (Bollapragada et al., 19 Jul 2025).

6. Practical Considerations, Implementation, and Empirical Behavior

Implementation recommendations include:

Use backtracking line-search if $L$ is not known; adapt $\alpha_k$ per-iteration as in the FISTA step.
For separable regularizers (ℓ1, group penalties), the proximal map is efficiently computable (e.g., soft-thresholding), maintaining low per-iteration cost (Bollapragada et al., 19 Jul 2025).
In large-scale or stochastic settings, begin with small gradient batches and increase as the iterate approaches stationarity, guided by reduced-gradient norm tests.
Alternatives such as monotonicity checks, adaptive damping, or block selection strategies may enhance robustness or exploit problem structure, especially in ill-conditioned or nonconvex regimes (Florea et al., 2017, Lau et al., 2017).

Empirically, FISTA robustly outperforms classical proximal gradient/ISTA for composite minimization, achieving orders-of-magnitude reduction in iterations for representative problems such as LASSO, image deblurring, convex quadratic programming, and regularized matrix completion (Bollapragada et al., 19 Jul 2025, Treskatis et al., 2015, Henderson et al., 16 Aug 2025). In the proximity of the solution, ISTA may locally outperform FISTA in some sparse regimes; hybrid strategies or dynamic switching can exploit this transition (Tao et al., 2015).

7. Generalizations, Connections, and Open Directions

FISTA is a special instance of the broader family of accelerated composite gradient methods, all of which share the key components: extrapolated iterates, parameterized Lyapunov potentials, and recursions tuned to curvature structure (Florea et al., 2017).

Recent work has shown that the fundamental acceleration mechanism extends—without requiring knowledge of the strong convexity modulus—by using iteration-varying kinetic coefficients in the Lyapunov analysis (Li et al., 2023, Li et al., 9 Apr 2025).

Adaptive, flexible FISTA-type schemes with generalized momenta (e.g., parameter pairs $(a,b)$ or weakly coupled sequences) provably retain $O(1/k^2)$ rates and, with controlled damping, guarantee convergence of the entire iterate sequence, not just function values (Tanabe et al., 2022). Further, a spectrum of monotone variants balance stability and acceleration across diverse applications.

The ODE and PEP perspectives continue to clarify the intrinsic optimality of FISTA and variants, connecting continuous and discrete-time dynamics and providing tools for tailored algorithm design in composite, stochastic, nonconvex, and multiobjective frameworks.

References

"On the Convergence and Complexity of Proximal Gradient and Accelerated Proximal Gradient Methods under Adaptive Gradient Estimation" (Bollapragada et al., 19 Jul 2025)
"Linear convergence of forward-backward accelerated algorithms without knowledge of the modulus of strong convexity" (Li et al., 2023)
"Monotonicity for Multiobjective Accelerated Proximal Gradient Methods" (Nishimura et al., 2022)
"Another look at the fast iterative shrinkage/thresholding algorithm (FISTA)" (Kim et al., 2016)
"A Generalized Accelerated Composite Gradient Method: Uniting Nesterov's Fast Gradient Method and FISTA" (Florea et al., 2017)
"Proximal Subgradient Norm Minimization of ISTA and FISTA" (Li et al., 2022)
"Relaxed Weak Accelerated Proximal Gradient Method: a Unified Framework for Nesterov's Accelerations" (Li et al., 9 Apr 2025)
"On Inexact Accelerated Proximal Gradient Methods with Relative Error Rules" (Bello-Cruz et al., 2020)
"Accelerating Proximal Gradient-type Algorithms using Damped Anderson Acceleration with Restarts and Nesterov Initialization" (Henderson et al., 16 Aug 2025)
"Local Linear Convergence of ISTA and FISTA on the LASSO Problem" (Tao et al., 2015)
"FISTA and Extensions -- Review and New Insights" (Kong et al., 2021)
"Accelerated Block Coordinate Proximal Gradients with Applications in High Dimensional Statistics" (Lau et al., 2017)
"A globally convergent fast iterative shrinkage-thresholding algorithm with a new momentum factor for single and multi-objective convex optimization" (Tanabe et al., 2022)
"RAPID: Rapidly Accelerated Proximal Gradient Algorithms for Convex Minimization" (Zhang et al., 2014)
"Adaptive FISTA for Non-convex Optimization" (Ochs et al., 2017)
"An accelerated proximal gradient method for multiobjective optimization" (Tanabe et al., 2022)
"Accelerated Proximal Gradient Method with Backtracking for Multiobjective Optimization" (Huang et al., 2024)
"An Accelerated Dual Proximal Gradient Method for Applications in Viscoplasticity" (Treskatis et al., 2015)