Proximal Gradient Methods: ISTA & FISTA

Updated 29 January 2026

Proximal gradient methods are first-order optimization algorithms that solve composite problems by combining a smooth objective with a nonsmooth regularization term.
ISTA performs iterative shrinkage with a proximal step achieving O(1/k) convergence, while FISTA accelerates this process to O(1/k²) using momentum-based extrapolation.
Recent extensions include adaptive, stochastic, and deep unfolding approaches that enhance performance in high-dimensional imaging, signal recovery, and machine learning applications.

Proximal Gradient Methods (ISTA, FISTA)

Proximal gradient methods are first-order optimization algorithms for composite problems of the form minimize $F(x) = f(x) + g(x)$ , where $f$ is smooth (typically $L$ -Lipschitz differentiable) and $g$ is convex but not necessarily smooth or differentiable. These techniques have become central in large-scale imaging inverse problems, signal recovery, machine learning, and sparse estimation, due to their simplicity, rigorous theoretical guarantees, and capacity to handle nonsmooth regularization structures via fast proximal operators.

1. Core Algorithms: ISTA and FISTA

The Iterative Shrinkage-Thresholding Algorithm (ISTA) is the archetype of proximal gradient methods. At each iteration, ISTA computes a gradient step for $f$ and then applies the proximal operator of $g$ : $x^{k+1} = \mathrm{prox}_{\lambda g}\left(x^k - t \,\nabla f(x^k)\right)$ where $\mathrm{prox}_{\lambda g}(v) = \arg\min_{x}\left\{g(x) + \frac{1}{2\lambda}\|x - v\|^2\right\}$ .

FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) accelerates ISTA by incorporating a momentum term $y^k$ , with extrapolation factor $\beta^k$ (related to a sequence $f$ 0 via $f$ 1): $f$ 2 This modification yields optimal $f$ 3 convergence for the objective gap in convex settings, compared to ISTA's $f$ 4 rate (Zhang et al., 2014, Kong et al., 2021, Kim et al., 2016).

2. Convergence Theory and Rate Statements

In the generic convex regime ( $f$ 5 convex):

ISTA: $f$ 6 (Kong et al., 2021, Zhang et al., 2014).
FISTA: $f$ 7 (Kong et al., 2021, Kim et al., 2016).

For strongly convex $f$ 8 ( $f$ 9), linear rates are provable:

ISTA: $L$ 0,
FISTA: $L$ 1 with matching behavior in squared proximal subgradient norm $L$ 2 (Li et al., 2022, Li et al., 2023).

Modern Lyapunov-based analyses generalize these statements: FISTA exhibits linear convergence under strong convexity, independent of explicit modulus knowledge, using high-resolution ODE and phase-space representations (Li et al., 2023).

From a worst-case analysis and performance estimation framework, FISTA is also "cost-function-optimal" among fixed-step first-order methods, with alternative step-coefficient rules providing refined bounds for composite gradient mapping (Kim et al., 2016).

Table: Summary of Key Iteration Complexities

Method	Convex Rate	Strongly Convex Rate	Reference
ISTA	$L$ 3	$L$ 4	(Kong et al., 2021)
FISTA	$L$ 5	$L$ 6	(Kong et al., 2021)
FISTA (ODE var.)	$L$ 7, linear w/o $L$ 8	linear ( $L$ 9, $g$ 0)	(Li et al., 2023)

3. Structural Extensions and Recent Algorithmic Variants

Adaptive, Dual, Multiobjective, and Banach-space Generalizations

Adaptive FISTA: Dynamically optimizes the extrapolation parameter $g$ 1 via line-search per iteration, sometimes equivalent to identity-minus-rank-1 SR1 proximal quasi-Newton steps. Convergence is proved for nonconvex objectives to stationary points (Ochs et al., 2017).
RAPID: Incorporates an auxiliary line search along the scalar ray $g$ 2 post proximal step, ensuring a per-iteration tighter bound and faster empirical convergence than standard APG or FISTA (Zhang et al., 2014).
FISTA* (Dual formulation): In viscoplastic flow, solving the dual (in stress space) by FISTA achieves $g$ 3 rates and reduces per iteration complexity, bypassing penalty-parameter tuning as in ADMM/ALG2 (Treskatis et al., 2015).
Multiobjective FISTA: Optimizes weak-Pareto frontals for $g$ 4, solving a composite max-of-convex subproblem via dualization per iteration. $g$ 5 convergence is established in merit functions measuring Pareto optimality (Tanabe et al., 2022).
Banach-space FISTA: Adaptive discretization in enlarging subsets $g$ 6 can recast FISTA for problems with minimizers in Banach spaces (e.g., $g$ 7, $g$ 8, measures), showing $g$ 9 energy convergence under suitable conditions (Chambolle et al., 2021).

Inexact and Stochastic Gradient Extensions

Inexact FISTA with Relative Error: Proximal steps can be solved inexactly as long as a relative error rule is satisfied, without summable or vanishing error terms; the same optimal complexity $f$ 0 is guaranteed (Bello-Cruz et al., 2020).
Adaptive Gradient Estimation: ISTA/FISTA variants equipped with biased or unbiased stochastic gradient estimators (with controlled variance) achieve optimal iteration and sample complexity—sublinear in general, linear in strongly convex settings, with adaptive sample sizes (Bollapragada et al., 19 Jul 2025).

4. Advanced Analysis: Subgradient Norms and High-Resolution Behavior

Phase-space and high-resolution differential equation frameworks provide refined convergence characterizations:

ISTA: Squared proximal-subgradient norm converges at inverse square rate, i.e., $f$ 1.
FISTA: The squared norm decays at inverse cubic rate, i.e., $f$ 2 (Li et al., 2022). This means FISTA attains a subgradient $f$ 3-stationary point in $f$ 4 (much fewer iterations than ISTA’s $f$ 5) for high-accuracy requirements.

The mechanism behind acceleration is rigorously traced to interaction of momentum and gradient correction terms, not merely the extrapolation coefficients (Li et al., 2022, Li et al., 2023).

5. Deep Learning and Algorithm Unfolding: FISTA-Net

Recent work has unfolded FISTA into trainable network architectures, most notably FISTA-Net (Xiang et al., 2020). In FISTA-Net:

Each layer mimics a FISTA update, but linear gradient steps are replaced by a learnable weight $f$ 6, the prox operator by a small shared CNN with trainable threshold $f$ 7, and the momentum coefficient $f$ 8 is also learned.
Key parameters ( $f$ 9, $g$ 0, $g$ 1) are constrained via softplus parametrizations to enforce positivity and monotonicity, mirroring the theoretical requirements for convergence.
The architecture is tuning-free; all essential parameters are learned from data.
Empirically, on Electromagnetic Tomography (nonlinear) and sparse-view X-ray CT (linear), FISTA-Net surpasses classical FISTA-TV and deep post-processing models, with similar convergence in far fewer (e.g., 7 vs 200+) forward stages (Xiang et al., 2020).

6. Implementation Paradigms and Practical Considerations

Step Size Selection: $g$ 2 is safest if $g$ 3 is known or estimated; adaptive backtracking and curvature tracking improves practical performance, as in Generalized ACGM (Florea et al., 2017).
Monotonicity Enforcement: Monotone FISTA variants (MFISTA, MACGM) can prevent oscillations, with negligible overhead when overshoots are rare (Florea et al., 2017).
Stopping Criteria: One can monitor the norm of the gradient mapping or composite subgradient, rather than function-value gap; inverse-cubic or linear rates enable precise iteration counts for reaching given tolerances (Li et al., 2022, Li et al., 2022).
Extensions to ADMM: Proximal gradient/FISTA can be recovered as a direct reformulation or reduction of ADMM for composite problems, yielding significant runtime reductions and scaling robustness (Shimmura et al., 2021).
Tensor/Multidimensional Generalizations: Proximal gradient methods and FISTA have been extended to tensor-valued spaces, with extrapolation via tensor least-squares/HOSVD-MPE acceleration. In image restoration tasks, such tensor extrapolation reduces iterations and improves PSNR measurably over vanilla ISTA/TDPG (Bentbib et al., 2024).

7. Summary and Impact

Proximal gradient methods, notably ISTA and FISTA, constitute the foundation for large-scale optimization in composite problems. They are characterized by transparent updates, rigorous complexity bounds, extend easily to inexact, adaptive, stochastic, dual, and multiobjective settings, and form the basis for modern deep algorithmic architectures. Recent theoretical advances from Lyapunov and high-resolution ODE analyses remove prior technical barriers (such as the requirement of explicit modulus knowledge for linear convergence), and empirical performance has been greatly extended by algorithm unfolding, curvature-adaptive steps, and domain-specific extrapolation, notably in imaging and sparse estimation.

For best practices, acceleration, adaptive step-size, monotonicity, and appropriate stopping rules should be included as standard, with problem structure exploited wherever possible. Proximal gradient methods continue to be a mainstay for both theory-driven and application-centric developments in convex and nonconvex optimization (Kong et al., 2021, Li et al., 2022, Kim et al., 2016, Li et al., 2022, Xiang et al., 2020, Bollapragada et al., 19 Jul 2025).