Papers
Topics
Authors
Recent
2000 character limit reached

Accelerated Proximal-Gradient Steps

Updated 16 December 2025
  • Accelerated Proximal-Gradient (APG) methods are first-order algorithms that combine momentum-based extrapolation with proximal mappings to rapidly accelerate convergence in composite optimization problems.
  • They achieve an O(1/k²) convergence rate for convex scenarios and can be tailored with adaptive restart and high-order strategies for strongly convex, nonconvex, and stochastic settings.
  • APG steps are widely applied in control, statistical learning, and imaging, efficiently addressing large-scale problems through structured momentum scheduling and proximal operations.

The Accelerated Proximal-Gradient (APG) method encompasses a class of first-order optimization algorithms designed to efficiently solve large-scale convex, and in extensions, nonconvex composite minimization problems of the form

minxF(x)=f(x)+g(x),\min_x\, F(x) = f(x) + g(x),

where ff is typically smooth and gg is convex (possibly nonsmooth) or an indicator function for constraints. APG methods, often termed Fast Iterative Shrinkage-Thresholding Algorithms (FISTA), employ momentum-based extrapolation combined with proximal mappings to achieve an accelerated convergence rate over standard proximal-gradient schemes. APG steps have been generalized to structured settings, higher-order polynomial schedules, dual and multiobjective problems, and have inspired variants for stochastic, nonconvex, Riemannian, and block-coordinate frameworks.

1. Formal Structure of the APG Step

The canonical APG step operates by generating a sequence of auxiliary (extrapolated) iterates using carefully calibrated momentum coefficients, followed by a proximal-gradient update. Let

  • η\eta denote the step-size (commonly η=1/L\eta=1/L with LL the Lipschitz constant of f\nabla f);
  • {θk}\{\theta_k\} or sequences derived from roots of high-order polynomial recurrence relations serve as momentum parameters.

The standard APG/FISTA iteration (for convex ff and closed proper convex gg) is: yk=xk+βk(xkxk1), xk+1=proxηg(ykηf(yk)),\begin{align*} y_k &= x_k + \beta_k(x_k - x_{k-1}), \ x_{k+1} &= \operatorname{prox}_{\eta g}(y_k - \eta \nabla f(y_k)), \end{align*} where βk\beta_k is the momentum weight. In the FISTA variant, βk=tk11tk\beta_k = \frac{t_{k-1} - 1}{t_k}, tk=1+1+4tk122t_{k} = \frac{1 + \sqrt{1 + 4t_{k-1}^2}}{2}, with initialization t0=1t_0=1 (Henderson et al., 16 Aug 2025, Driggs et al., 25 Mar 2024).

Advanced variants (e.g., in model predictive control QP) employ momentum parameters μp\mu_p drawn from α\alpha-order polynomial recurrences: μp+1αμp+11μpα=0,  μ1=1,\mu_{p+1}^\alpha - \mu_{p+1}^{-1} - \mu_p^\alpha = 0, \ \ \mu_1=1, enabling an iterate-extrapolated APG update with an improved convergence bound (Wang et al., 2021).

2. Convergence Rates and Momentum Scheduling

Convergence rate analysis is central to APG methodology. For classical convex objectives,

F(xk)F(x)=O(1/k2).F(x_k) - F(x^*) = O(1/k^2).

If ff is strongly convex, a geometric rate is recovered (Bollapragada et al., 19 Jul 2025, Zhu et al., 24 Jul 2025). In the “high-order” APG scheme for quadratic programs,

λpλ22CLλ0λ22(p+α1)α,\| \lambda^p - \lambda^* \|_2^2 \leq C \cdot \frac{L \| \lambda^0 - \lambda^* \|_2^2}{(p+\alpha-1)^\alpha},

with α2\alpha \geq 2 a user parameter and μp\mu_p selected accordingly (Wang et al., 2021). Practically, increasing α\alpha tightens theoretical rates (O(1/pα1/p^\alpha)), but marginal wall-time improvements saturate at moderate α\alpha (see empirical thresholds in Fig. 4 of (Wang et al., 2021)).

Adaptive and restart schemes have been developed to deal with unknown regularity (e.g., Hölderian error bounds), strong convexity parameters, or to avoid “overshooting” (oscillatory behavior) (Henderson et al., 16 Aug 2025, Liu et al., 2016, Zhou et al., 2020). Typical restart triggers are based on objective nonmonotonicity, inner-product (gradient mapping) conditions, or directly on stationary measure decreases.

3. Generalizations and Specializations of APG

APG-type steps have been extended to:

  • Dual problems and quadratic programming: Primal MPC QPs are solved efficiently by dualizing to an unconstrained form and conducting APG in the dual multipliers, with projection onto the nonnegative orthant as the prox-map (Wang et al., 2021).
  • Stochastic and adaptive-gradient settings: APG retains optimal rates when gradient oracles are replaced by adaptive sample-average estimators, provided the variance and bias are controlled according to a state-dependent policy (e.g., Condition 2.1) (Bollapragada et al., 19 Jul 2025, Zhu et al., 24 Jul 2025).
  • Multiobjective and vector optimization: APG has been formulated for vector-valued objectives, using scalar merit functions (e.g., supzmini{Fi(x)Fi(z)}\sup_z \min_i \{ F_i(x) - F_i(z) \}) and subproblems involving max-over-indices, with efficient dual representations (Tanabe et al., 2022, Huang, 9 Jul 2025).
  • Riemannian optimization: Geodesic convexity and retraction-convexity conditions permit Riemannian APG steps, with all momentum and proximal operations mapped via exponential/retraction and parallel transport (Feng et al., 26 Sep 2025).
  • Block-coordinate/nonconvex settings: Acceleration and monotonicity principles extend to block-updates and to nonconvex settings, often requiring adaptively tamed momentum and rigorous monotonicity checks (Lau et al., 2017, Li et al., 2017, Yao et al., 2016).

4. Implementation and Pseudocode Structures

The algorithmic structure of APG depends on the application context. For the model predictive control QP (Wang et al., 2021), the APG pseudocode is as follows:

1
2
3
4
5
6
7
8
9
initialize: λ=0, λ¹=0, μ=1, precompute lookup T[1..P] of μₚ
for p = 1,2,...,P:
    μₚ = T[p+1]
    βₚ = (μₚ1)/μₚ
    yᵖ = λᵖ + βₚ * (λᵖ  λᵖ¹)
    grad = AH¹(Aᵀyᵖ + G) + B
    λᵖ = max{0, yᵖ  (1/L) * grad}
    if ||λᵖ  λᵖ|| < ε: break
uᵖ = H¹(AᵀλᵖG)

For composite convex problems, generic APG pseudocode is:

1
2
3
4
5
6
initialize: x = x¹; t = 1
for k = 0,1,2,...:
    tₖ = (1 + sqrt(1 + 4 tₖ²)) / 2
    βₖ = (tₖ - 1) / tₖ
    yₖ = xₖ + βₖ * (xₖ - xₖ)
    xₖ = prox_{ηg}(yₖ - η f(yₖ))
The actual implementation nuances—momentum selection, adaptive restart, high-order updates—depend on the domain and convergence regime.

5. Assumptions and Theoretical Guarantees

Classical APG analysis is predicated on:

  • Convexity of ff and gg (or indicator structure for constraints);
  • Lipschitz continuity of f\nabla f, with known LL, or access to local backtracking for step size;
  • For dual QP: strict feasibility (Slater's condition), positive definite quadratic term in the primal, and standard primal-dual relations (Wang et al., 2021).

For nonconvex or inexact settings, ensuring global convergence requires objective coercivity and enforcing sufficient decrease or duality-gap control in each step (Yao et al., 2016, Li et al., 2017). APG with adaptive inexactness (e.g., shadow-point or stationarity-based stopping for prox-maps) can preserve O(1/k2)O(1/k^2) rates with only summable absolute errors (Yang et al., 29 Apr 2025). In the affine-quadratic regime, APG enjoys weak or strong convergence to the best approximation of the initial point in the solution set (Moursi et al., 9 Nov 2025).

6. Practical Considerations and Extensions

Selecting the momentum order (in the polynomial-APG), restart policies, gradient sampling rules, and projection/inexactness thresholds involves empirical and application-specific tuning:

  • For “high-order” APG in QP, α[5,20]\alpha \in [5,20] empirically provides the best wall-time reduction without overhead (Wang et al., 2021).
  • Adaptive restart based on monotonicity or gradient mappings is essential for robustness, particularly when parameters (e.g., strong convexity) are unknown (Henderson et al., 16 Aug 2025, Zhou et al., 2020).
  • In block-coordinate and high-dimensional contexts, greedy coordinate selection (e.g., Gauss–Southwell rule) plus adaptive damping of momentum is effective (Lau et al., 2017).
  • For stochastic, splitting, and multiobjective settings, APG frameworks generalize seamlessly, maintaining theoretical and practical efficiency by coupling acceleration with sampling control or dual-variable updates (Bollapragada et al., 19 Jul 2025, Tanabe et al., 2022, Driggs et al., 25 Mar 2024).

7. Applications and Impact

APG steps underpin a wide array of large-scale optimization applications, especially in control (MPC), statistical learning (LASSO, SVMs, matrix completion), signal processing, and computational imaging—domains where composite structure and high problem dimensionality are ubiquitous (Wang et al., 2021, Zhang et al., 2014, Lau et al., 2017, Driggs et al., 25 Mar 2024). APG-based solvers routinely outperform classical first-order and even some state-of-the-art interior point methods (e.g., MOSEK, ECOS in small-to-moderate QPs), especially when the problem is amenable to proximal mappings and fast matrix-vector operations (Wang et al., 2021). Numerous recent studies have introduced further variants leveraging adaptive inexactness, high-order acceleration, manifold geometry, and multiobjective coupling, reflecting the breadth and ongoing evolution of APG theory and practice.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Accelerated Proximal-Gradient (APG) Steps.