Accelerated Proximal-Gradient Steps

Updated 16 December 2025

Accelerated Proximal-Gradient (APG) methods are first-order algorithms that combine momentum-based extrapolation with proximal mappings to rapidly accelerate convergence in composite optimization problems.
They achieve an O(1/k²) convergence rate for convex scenarios and can be tailored with adaptive restart and high-order strategies for strongly convex, nonconvex, and stochastic settings.
APG steps are widely applied in control, statistical learning, and imaging, efficiently addressing large-scale problems through structured momentum scheduling and proximal operations.

The Accelerated Proximal-Gradient (APG) method encompasses a class of first-order optimization algorithms designed to efficiently solve large-scale convex, and in extensions, nonconvex composite minimization problems of the form

$\min_x\, F(x) = f(x) + g(x),$

where $f$ is typically smooth and $g$ is convex (possibly nonsmooth) or an indicator function for constraints. APG methods, often termed Fast Iterative Shrinkage-Thresholding Algorithms (FISTA), employ momentum-based extrapolation combined with proximal mappings to achieve an accelerated convergence rate over standard proximal-gradient schemes. APG steps have been generalized to structured settings, higher-order polynomial schedules, dual and multiobjective problems, and have inspired variants for stochastic, nonconvex, Riemannian, and block-coordinate frameworks.

1. Formal Structure of the APG Step

The canonical APG step operates by generating a sequence of auxiliary (extrapolated) iterates using carefully calibrated momentum coefficients, followed by a proximal-gradient update. Let

$\eta$ denote the step-size (commonly $\eta=1/L$ with $L$ the Lipschitz constant of $\nabla f$ );
$\{\theta_k\}$ or sequences derived from roots of high-order polynomial recurrence relations serve as momentum parameters.

The standard APG/FISTA iteration (for convex $f$ and closed proper convex $g$ ) is: $\begin{align*} y_k &= x_k + \beta_k(x_k - x_{k-1}), \ x_{k+1} &= \operatorname{prox}_{\eta g}(y_k - \eta \nabla f(y_k)), \end{align*}$ where $\beta_k$ is the momentum weight. In the FISTA variant, $\beta_k = \frac{t_{k-1} - 1}{t_k}$ , $t_{k} = \frac{1 + \sqrt{1 + 4t_{k-1}^2}}{2}$ , with initialization $t_0=1$ (Henderson et al., 16 Aug 2025, Driggs et al., 25 Mar 2024).

Advanced variants (e.g., in model predictive control QP) employ momentum parameters $\mu_p$ drawn from $\alpha$ -order polynomial recurrences: $\mu_{p+1}^\alpha - \mu_{p+1}^{-1} - \mu_p^\alpha = 0, \ \ \mu_1=1,$ enabling an iterate-extrapolated APG update with an improved convergence bound (Wang et al., 2021).

2. Convergence Rates and Momentum Scheduling

Convergence rate analysis is central to APG methodology. For classical convex objectives,

$F(x_k) - F(x^*) = O(1/k^2).$

If $f$ is strongly convex, a geometric rate is recovered (Bollapragada et al., 19 Jul 2025, Zhu et al., 24 Jul 2025). In the “high-order” APG scheme for quadratic programs,

$\| \lambda^p - \lambda^* \|_2^2 \leq C \cdot \frac{L \| \lambda^0 - \lambda^* \|_2^2}{(p+\alpha-1)^\alpha},$

with $\alpha \geq 2$ a user parameter and $\mu_p$ selected accordingly (Wang et al., 2021). Practically, increasing $\alpha$ tightens theoretical rates (O( $1/p^\alpha$ )), but marginal wall-time improvements saturate at moderate $\alpha$ (see empirical thresholds in Fig. 4 of (Wang et al., 2021)).

Adaptive and restart schemes have been developed to deal with unknown regularity (e.g., Hölderian error bounds), strong convexity parameters, or to avoid “overshooting” (oscillatory behavior) (Henderson et al., 16 Aug 2025, Liu et al., 2016, Zhou et al., 2020). Typical restart triggers are based on objective nonmonotonicity, inner-product (gradient mapping) conditions, or directly on stationary measure decreases.

3. Generalizations and Specializations of APG

APG-type steps have been extended to:

Dual problems and quadratic programming: Primal MPC QPs are solved efficiently by dualizing to an unconstrained form and conducting APG in the dual multipliers, with projection onto the nonnegative orthant as the prox-map (Wang et al., 2021).
Stochastic and adaptive-gradient settings: APG retains optimal rates when gradient oracles are replaced by adaptive sample-average estimators, provided the variance and bias are controlled according to a state-dependent policy (e.g., Condition 2.1) (Bollapragada et al., 19 Jul 2025, Zhu et al., 24 Jul 2025).
Multiobjective and vector optimization: APG has been formulated for vector-valued objectives, using scalar merit functions (e.g., $\sup_z \min_i \{ F_i(x) - F_i(z) \}$ ) and subproblems involving max-over-indices, with efficient dual representations (Tanabe et al., 2022, Huang, 9 Jul 2025).
Riemannian optimization: Geodesic convexity and retraction-convexity conditions permit Riemannian APG steps, with all momentum and proximal operations mapped via exponential/retraction and parallel transport (Feng et al., 26 Sep 2025).
Block-coordinate/nonconvex settings: Acceleration and monotonicity principles extend to block-updates and to nonconvex settings, often requiring adaptively tamed momentum and rigorous monotonicity checks (Lau et al., 2017, Li et al., 2017, Yao et al., 2016).

4. Implementation and Pseudocode Structures

The algorithmic structure of APG depends on the application context. For the model predictive control QP (Wang et al., 2021), the APG pseudocode is as follows:

initialize: λ⁰=0, λ¹=0, μ₁=1, precompute lookup T[1..P] of μₚ
for p = 1,2,...,P:
    μₚ₊₁ = T[p+1]
    βₚ = (μₚ–1)/μₚ₊₁
    yᵖ = λᵖ + βₚ * (λᵖ – λᵖ⁻¹)
    grad = A H⁻¹(Aᵀ yᵖ + G) + B
    λᵖ₊₁ = max{0, yᵖ – (1/L) * grad}
    if ||λᵖ₊₁ – λᵖ||₂ < ε: break
uᵖ = H⁻¹(–Aᵀλᵖ–G)

For composite convex problems, generic APG pseudocode is:

initialize: x⁰ = x⁻¹; t₀ = 1
for k = 0,1,2,...:
    tₖ₊₁ = (1 + sqrt(1 + 4 tₖ²)) / 2
    βₖ₊₁ = (tₖ - 1) / tₖ₊₁
    yₖ = xₖ + βₖ₊₁ * (xₖ - xₖ₋₁)
    xₖ₊₁ = prox_{ηg}(yₖ - η ∇f(yₖ))

The actual implementation nuances—momentum selection, adaptive restart, high-order updates—depend on the domain and convergence regime.

5. Assumptions and Theoretical Guarantees

Classical APG analysis is predicated on:

Convexity of $f$ and $g$ (or indicator structure for constraints);
Lipschitz continuity of $\nabla f$ , with known $L$ , or access to local backtracking for step size;
For dual QP: strict feasibility (Slater's condition), positive definite quadratic term in the primal, and standard primal-dual relations (Wang et al., 2021).

For nonconvex or inexact settings, ensuring global convergence requires objective coercivity and enforcing sufficient decrease or duality-gap control in each step (Yao et al., 2016, Li et al., 2017). APG with adaptive inexactness (e.g., shadow-point or stationarity-based stopping for prox-maps) can preserve $O(1/k^2)$ rates with only summable absolute errors (Yang et al., 29 Apr 2025). In the affine-quadratic regime, APG enjoys weak or strong convergence to the best approximation of the initial point in the solution set (Moursi et al., 9 Nov 2025).

6. Practical Considerations and Extensions

Selecting the momentum order (in the polynomial-APG), restart policies, gradient sampling rules, and projection/inexactness thresholds involves empirical and application-specific tuning:

For “high-order” APG in QP, $\alpha \in [5,20]$ empirically provides the best wall-time reduction without overhead (Wang et al., 2021).
Adaptive restart based on monotonicity or gradient mappings is essential for robustness, particularly when parameters (e.g., strong convexity) are unknown (Henderson et al., 16 Aug 2025, Zhou et al., 2020).
In block-coordinate and high-dimensional contexts, greedy coordinate selection (e.g., Gauss–Southwell rule) plus adaptive damping of momentum is effective (Lau et al., 2017).
For stochastic, splitting, and multiobjective settings, APG frameworks generalize seamlessly, maintaining theoretical and practical efficiency by coupling acceleration with sampling control or dual-variable updates (Bollapragada et al., 19 Jul 2025, Tanabe et al., 2022, Driggs et al., 25 Mar 2024).

7. Applications and Impact

APG steps underpin a wide array of large-scale optimization applications, especially in control (MPC), statistical learning (LASSO, SVMs, matrix completion), signal processing, and computational imaging—domains where composite structure and high problem dimensionality are ubiquitous (Wang et al., 2021, Zhang et al., 2014, Lau et al., 2017, Driggs et al., 25 Mar 2024). APG-based solvers routinely outperform classical first-order and even some state-of-the-art interior point methods (e.g., MOSEK, ECOS in small-to-moderate QPs), especially when the problem is amenable to proximal mappings and fast matrix-vector operations (Wang et al., 2021). Numerous recent studies have introduced further variants leveraging adaptive inexactness, high-order acceleration, manifold geometry, and multiobjective coupling, reflecting the breadth and ongoing evolution of APG theory and practice.