Papers
Topics
Authors
Recent
Search
2000 character limit reached

Momentum-Accelerated Proximal AGD (AltGDAm)

Updated 13 April 2026
  • Momentum-Accelerated Proximal AGD (AltGDAm) is a block-coordinate method that integrates Nesterov-type and adaptive momentum for tackling nonconvex, nonsmooth composite optimization problems.
  • The algorithm employs block-wise splitting, adaptive extrapolation, and monotonicity enforcement under the Kurdyka-Ɓojasiewicz framework to guarantee global convergence and offer local linear rates.
  • It extends classical APG techniques to applications like high-dimensional sparse regression, matrix/tensor factorization, and adversarial learning, providing provable convergence and improved empirical performance.

Momentum-Accelerated Proximal AGD (AltGDAm), also referred to as accelerated block coordinate proximal gradient with adaptive momentum, denotes a class of block-coordinate algorithms for nonconvex, nonsmooth composite optimization that incorporate Nesterov-type, adaptive, or generalized momentum into the proximal gradient or proximal linear framework. These methods combine block-wise splitting, adaptive extrapolation, and monotonicity enforcement within the structure of the Kurdyka-Ɓojasiewicz (KL) convergence analysis, yielding provable global convergence and, under appropriate settings, local linear rates for a broad range of high-dimensional statistical learning and estimation problems. The approach subsumes and extends the classical accelerated proximal gradient (APG/FISTA) updates to block-separable, nonconvex, and even minimax or saddle-point settings.

1. Formal Problem Statement and Assumptions

Momentum-accelerated proximal AGD (AltGDAm) addresses composite minimization problems of the form

min⁡x∈Rn  F(x):=f(x)+∑i=1sgi(xi),\min_{x\in\mathbb{R}^n}\; F(x) := f(x) + \sum_{i=1}^s g_i(x_i),

where the variable xx is partitioned into ss blocks, ff is continuous and differentiable (possibly nonconvex), and each gig_i is proper, lower semicontinuous, block-separable, and possibly nonsmooth or nonconvex (Lau et al., 2017).

The key assumptions are:

  • FF is bounded below and admits at least one critical point x∗x^*, 0∈∂F(x∗)0\in\partial F(x^*).
  • Each block partial gradient xi↩∇xif(xÂŹi,xi)x_i\mapsto\nabla_{x_i}f(x_{\neg i}, x_i) is LiL_i-Lipschitz in xx0.
  • xx1 satisfies the KL property at every cluster point.
  • Every block is updated at least once within any window of xx2 successive steps.

This framework accommodates important regularized regression objectives (Lasso, group Lasso, capped xx3, SCAD), matrix/tensor factorization with xx4 penalties, robust minimax learning, and regularized image reconstruction.

2. Algorithmic Structure and Update Rules

AltGDAm iterates alternating, block-wise, momentum-accelerated proximal-gradient steps. The generic iteration includes:

Block selection (Gauss–Southwell, uniform, or cyclic rules): xx5

Momentum Extrapolation (for each xx6): xx7

Proximal-Gradient Step (current block xx8): xx9

Accelerated candidate: ss0

Momentum Adaptation (monotonicity enforcement): ss1 The non-decreasing or shrinkage rule for the block-momentum parameter ss2 prevents divergence due to momentum overshoot.

Typical step size selection is ss3, ss4 (Lau et al., 2017).

3. Theoretical Guarantees: Global and Local Rates

Under the above assumptions, AltGDAm exhibits the following convergence properties (Lau et al., 2017, Yang et al., 2023, Li et al., 2017):

  • Global convergence: The iterates converge to a critical point ss5, i.e., ss6 and ss7.
  • Subgradient residual decay: There exists ss8 such that ss9, and the telescoping argument yields ff0.
  • KL-based local convergence rate: If ff1 satisfies the KL property at ff2 with exponent ff3 (as for Lasso, group Lasso, SCAD), then ff4 for some ff5, i.e., local ff6-linear convergence (Lau et al., 2017). For ff7, sublinear rates ff8 are implied (Yang et al., 2023, Li et al., 2017).
Regime KL Exponent ff9 Rate
Finite Steps gig_i0 Finite
Linear gig_i1 gig_i2
Sublinear gig_i3 gig_i4

In block-minimax contexts, momentum-accelerated Alternating GDA with proximal steps achieves improved complexity gig_i5 for nonconvex-strongly-concave saddle point problems compared to prior gig_i6 alternating schemes (Chen et al., 2021).

4. Connections, Variants, and Adaptive Extensions

Momentum-accelerated proximal AGD has several algorithmic variants and connections:

  • Adaptive Restart and Momentum Scheduling: Algorithms such as APG-restart couple momentum with flexible restart based on function or gradient mapping tests; this prevents momentum-induced oscillations and enhances empirical convergence, maintaining global gig_i7 rates even in the nonconvex, nonsmooth regime (Zhou et al., 2020).
  • Generalized Nesterov Momentum: Extensions to gig_i8-power momentum schemes (parametric interpolation between classical and slower-growth Nesterov momentum) provide improved robustness and high-order convergence in convex smooth preconditioned settings (Lin et al., 2024).
  • Block-Coordinate Generality: AltGDAm encompasses both cyclic and random block ordering (see ABPLgig_i9 (Yang et al., 2023)) and applies to fully nonconvex, nonsmooth matrix/tensor factorization, where adaptivity and monotonicity are preserved without loss of convergence properties.
  • Monotonicity Enforcement: Monotonicity checks on the objective (or surrogate) after extrapolation—accepting only if the function value decreases—are universal, analytically critical tools (Lau et al., 2017, Li et al., 2017).

5. Empirical and Practical Behavior

Numerical exemplars emphasize AltGDAm's advantage:

  • High-dimensional sparse regression: For FF0-LS, group Lasso, capped-FF1, and SCAD (all with FF2–FF3 blocks), AltGDAm (with Gauss–Southwell or random block selection) consistently outperforms APG, mAPG, BPL, or APGnc in both early and late stages of convergence. The Gauss–Southwell rule, in particular, yields the fastest empirical decay of optimality gap (Lau et al., 2017).
  • NMF/Tensor Decomposition: ABPLFF4, a practical AltGDAm implementation, demonstrates superiority over PALM, iPALM, and related block methods for FF5-constrained nonnegative matrix and tensor factorization (Yang et al., 2023).
  • Adversarial Learning: In robust classification minimax setups (e.g., Wasserstein robust MNIST), momentum-proximal AltGDAm accelerates both objective reduction and robust accuracy relative to GDA, prox-AltGDA, and primal-dual baselines (Chen et al., 2021).
  • Preconditioned Variants: In image reconstruction, AltGDAm with EM-type preconditioning and general Nesterov momentum achieves order-optimal rates FF6 in objective, robust to aggressive or conservative momentum parameters (Lin et al., 2024).

6. Extensions and Analytical Framework

  • Generalization to Multiple Nondifferentiable Terms: The methodology naturally extends to settings with multiple nonsmooth separable regularizers, via block-wise fixed-point and proximal splitting (Lin et al., 2024).
  • Inexact Proximal Computations: The proof template accommodates FF7-approximate proximal updates, with all main results unchanged provided the error is dominated by the update distance (Yang et al., 2023, Li et al., 2017).
  • KL-based Theory: All main convergence theorems fundamentally depend on the KL property, which bridges nonconvexity and non-smoothness; the method exploits sufficient decrease and subgradient bounds to invoke KL-based Lyapunov analysis, leading to precise asymptotic rates (Lau et al., 2017, Yang et al., 2023, Li et al., 2017).

7. Comparative Perspective

Relative to classic APG/FISTA frameworks, AltGDAm distinctively enables scalable, provably convergent optimization of high-dimensional nonconvex problems by:

  • Supporting block-wise and adaptive update schedules.
  • Allowing nonsmooth, nonconvex, and separable regularizers without bounded-domain constraints.
  • Unifying monotonicity enforcement and adaptive momentum via restart/shrinkage without complex tuning requirements (Lau et al., 2017, Zhou et al., 2020, Yang et al., 2023).
  • Delivering improved empirical and provable convergence over standard block proximal/alternating and non-accelerated block methods, particularly in ill-conditioned or overparameterized regimes.

The method therefore constitutes a rigorous foundation for large-scale, sparse, and structured learning in contemporary computational statistics and machine learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Momentum-Accelerated Proximal AGD (AltGDAm).