Momentum-Accelerated Proximal AGD (AltGDAm)

Updated 13 April 2026

Momentum-Accelerated Proximal AGD (AltGDAm) is a block-coordinate method that integrates Nesterov-type and adaptive momentum for tackling nonconvex, nonsmooth composite optimization problems.
The algorithm employs block-wise splitting, adaptive extrapolation, and monotonicity enforcement under the Kurdyka-Łojasiewicz framework to guarantee global convergence and offer local linear rates.
It extends classical APG techniques to applications like high-dimensional sparse regression, matrix/tensor factorization, and adversarial learning, providing provable convergence and improved empirical performance.

Momentum-Accelerated Proximal AGD (AltGDAm), also referred to as accelerated block coordinate proximal gradient with adaptive momentum, denotes a class of block-coordinate algorithms for nonconvex, nonsmooth composite optimization that incorporate Nesterov-type, adaptive, or generalized momentum into the proximal gradient or proximal linear framework. These methods combine block-wise splitting, adaptive extrapolation, and monotonicity enforcement within the structure of the Kurdyka-Łojasiewicz (KL) convergence analysis, yielding provable global convergence and, under appropriate settings, local linear rates for a broad range of high-dimensional statistical learning and estimation problems. The approach subsumes and extends the classical accelerated proximal gradient (APG/FISTA) updates to block-separable, nonconvex, and even minimax or saddle-point settings.

1. Formal Problem Statement and Assumptions

Momentum-accelerated proximal AGD (AltGDAm) addresses composite minimization problems of the form

$\min_{x\in\mathbb{R}^n}\; F(x) := f(x) + \sum_{i=1}^s g_i(x_i),$

where the variable $x$ is partitioned into $s$ blocks, $f$ is continuous and differentiable (possibly nonconvex), and each $g_i$ is proper, lower semicontinuous, block-separable, and possibly nonsmooth or nonconvex (Lau et al., 2017).

The key assumptions are:

$F$ is bounded below and admits at least one critical point $x^*$ , $0\in\partial F(x^*)$ .
Each block partial gradient $x_i\mapsto\nabla_{x_i}f(x_{\neg i}, x_i)$ is $L_i$ -Lipschitz in $x$ 0.
$x$ 1 satisfies the KL property at every cluster point.
Every block is updated at least once within any window of $x$ 2 successive steps.

This framework accommodates important regularized regression objectives (Lasso, group Lasso, capped $x$ 3, SCAD), matrix/tensor factorization with $x$ 4 penalties, robust minimax learning, and regularized image reconstruction.

2. Algorithmic Structure and Update Rules

AltGDAm iterates alternating, block-wise, momentum-accelerated proximal-gradient steps. The generic iteration includes:

Block selection (Gauss–Southwell, uniform, or cyclic rules): $x$ 5

Momentum Extrapolation (for each $x$ 6): $x$ 7

Proximal-Gradient Step (current block $x$ 8): $x$ 9

Accelerated candidate: $s$ 0

Momentum Adaptation (monotonicity enforcement): $s$ 1 The non-decreasing or shrinkage rule for the block-momentum parameter $s$ 2 prevents divergence due to momentum overshoot.

Typical step size selection is $s$ 3, $s$ 4 (Lau et al., 2017).

3. Theoretical Guarantees: Global and Local Rates

Under the above assumptions, AltGDAm exhibits the following convergence properties (Lau et al., 2017, Yang et al., 2023, Li et al., 2017):

Global convergence: The iterates converge to a critical point $s$ 5, i.e., $s$ 6 and $s$ 7.
Subgradient residual decay: There exists $s$ 8 such that $s$ 9, and the telescoping argument yields $f$ 0.
KL-based local convergence rate: If $f$ 1 satisfies the KL property at $f$ 2 with exponent $f$ 3 (as for Lasso, group Lasso, SCAD), then $f$ 4 for some $f$ 5, i.e., local $f$ 6-linear convergence (Lau et al., 2017). For $f$ 7, sublinear rates $f$ 8 are implied (Yang et al., 2023, Li et al., 2017).

Regime	KL Exponent $f$ 9	Rate
Finite Steps	$g_i$ 0	Finite
Linear	$g_i$ 1	$g_i$ 2
Sublinear	$g_i$ 3	$g_i$ 4

In block-minimax contexts, momentum-accelerated Alternating GDA with proximal steps achieves improved complexity $g_i$ 5 for nonconvex-strongly-concave saddle point problems compared to prior $g_i$ 6 alternating schemes (Chen et al., 2021).

4. Connections, Variants, and Adaptive Extensions

Momentum-accelerated proximal AGD has several algorithmic variants and connections:

Adaptive Restart and Momentum Scheduling: Algorithms such as APG-restart couple momentum with flexible restart based on function or gradient mapping tests; this prevents momentum-induced oscillations and enhances empirical convergence, maintaining global $g_i$ 7 rates even in the nonconvex, nonsmooth regime (Zhou et al., 2020).
Generalized Nesterov Momentum: Extensions to $g_i$ 8-power momentum schemes (parametric interpolation between classical and slower-growth Nesterov momentum) provide improved robustness and high-order convergence in convex smooth preconditioned settings (Lin et al., 2024).
Block-Coordinate Generality: AltGDAm encompasses both cyclic and random block ordering (see ABPL $g_i$ 9 (Yang et al., 2023)) and applies to fully nonconvex, nonsmooth matrix/tensor factorization, where adaptivity and monotonicity are preserved without loss of convergence properties.
Monotonicity Enforcement: Monotonicity checks on the objective (or surrogate) after extrapolation—accepting only if the function value decreases—are universal, analytically critical tools (Lau et al., 2017, Li et al., 2017).

5. Empirical and Practical Behavior

Numerical exemplars emphasize AltGDAm's advantage:

High-dimensional sparse regression: For $F$ 0-LS, group Lasso, capped- $F$ 1, and SCAD (all with $F$ 2– $F$ 3 blocks), AltGDAm (with Gauss–Southwell or random block selection) consistently outperforms APG, mAPG, BPL, or APGnc in both early and late stages of convergence. The Gauss–Southwell rule, in particular, yields the fastest empirical decay of optimality gap (Lau et al., 2017).
NMF/Tensor Decomposition: ABPL $F$ 4, a practical AltGDAm implementation, demonstrates superiority over PALM, iPALM, and related block methods for $F$ 5-constrained nonnegative matrix and tensor factorization (Yang et al., 2023).
Adversarial Learning: In robust classification minimax setups (e.g., Wasserstein robust MNIST), momentum-proximal AltGDAm accelerates both objective reduction and robust accuracy relative to GDA, prox-AltGDA, and primal-dual baselines (Chen et al., 2021).
Preconditioned Variants: In image reconstruction, AltGDAm with EM-type preconditioning and general Nesterov momentum achieves order-optimal rates $F$ 6 in objective, robust to aggressive or conservative momentum parameters (Lin et al., 2024).

6. Extensions and Analytical Framework

Generalization to Multiple Nondifferentiable Terms: The methodology naturally extends to settings with multiple nonsmooth separable regularizers, via block-wise fixed-point and proximal splitting (Lin et al., 2024).
Inexact Proximal Computations: The proof template accommodates $F$ 7-approximate proximal updates, with all main results unchanged provided the error is dominated by the update distance (Yang et al., 2023, Li et al., 2017).
KL-based Theory: All main convergence theorems fundamentally depend on the KL property, which bridges nonconvexity and non-smoothness; the method exploits sufficient decrease and subgradient bounds to invoke KL-based Lyapunov analysis, leading to precise asymptotic rates (Lau et al., 2017, Yang et al., 2023, Li et al., 2017).

7. Comparative Perspective

Relative to classic APG/FISTA frameworks, AltGDAm distinctively enables scalable, provably convergent optimization of high-dimensional nonconvex problems by:

Supporting block-wise and adaptive update schedules.
Allowing nonsmooth, nonconvex, and separable regularizers without bounded-domain constraints.
Unifying monotonicity enforcement and adaptive momentum via restart/shrinkage without complex tuning requirements (Lau et al., 2017, Zhou et al., 2020, Yang et al., 2023).
Delivering improved empirical and provable convergence over standard block proximal/alternating and non-accelerated block methods, particularly in ill-conditioned or overparameterized regimes.

The method therefore constitutes a rigorous foundation for large-scale, sparse, and structured learning in contemporary computational statistics and machine learning.