Alternating Gradient Descent (AGD)

Updated 26 October 2025

AGD is an optimization method that sequentially updates variable blocks, offering stability and improved convergence for nonconvex, minimax, and matrix factorization problems.
It achieves superior iteration complexity in saddle-point and bilinear games by leveraging fresh gradient information in an alternating update framework.
Variants of AGD incorporate momentum, adaptive step-sizes, and variance reduction to efficiently handle high correlation, heterogeneous data, and complex problem structures.

Alternating Gradient Descent (AGD) refers to a family of iterative optimization methods wherein two or more blocks of variables are updated in an alternating (sequential) manner using gradient-based steps. In contrast to simultaneous updates, each iteration of AGD optimizes one block of variables while keeping the others fixed, then proceeds to update the next block with the freshest available information. This scheme has become a central paradigm in non-convex optimization, matrix factorization, minimax problems, adversarial learning, and multi-agent games, where it is frequently preferred for both practical stability and theoretical convergence properties.

1. Core Algorithmic Structure

The canonical AGD update decomposes the optimization variable $x \in \mathbb{R}^{d}$ into blocks $(x_1, x_2)$ and performs sequential updates: $\begin{align*} x_1^{(t+1)} &= x_1^{(t)} - \eta \nabla_1 f(x_1^{(t)}, x_2^{(t)}) \ x_2^{(t+1)} &= x_2^{(t)} - \eta \nabla_2 f(x_1^{(t+1)}, x_2^{(t)}) \end{align*}$ For minimax or saddle-point problems, the ascent step on $x_2$ (or $y$ ) uses the most up-to-date $x_1$ (or $x$ ). This scheme can be extended to more blocks and more general proximal or accelerated variants.

In specific settings, such as non-negative matrix factorization (Li et al., 2017), AGD alternates between a "decode" phase via pseudoinverse and thresholding and an "update" phase for the feature matrix, leveraging the block-wise structure to tackle strong feature correlations and noise.

2. Convergence Rates and Theoretical Guarantees

The iteration complexity and convergence of AGD have received extensive analysis, especially in comparison to simultaneous gradient methods.

Minimax and Saddle-Point Problems: For strongly-convex-strongly-concave objectives, AGD (Alt-GDA) achieves an iteration complexity of $\Theta((\kappa_x + \kappa_y + \kappa_{xy}(\sqrt{\kappa_x} + \sqrt{\kappa_y})) \log(1/\epsilon))$ , which is strictly superior to simultaneous methods ( $\Theta((\kappa_x + \kappa_y + \kappa_{xy}^2)\log(1/\epsilon))$ ), especially as coupling increases (Lee et al., 16 Feb 2024). The performance gap is theoretically justified via fine-grained Lyapunov analysis.
Linear Convergence in Bilinear Games: Alternating-Extrapolation GDA (Alex-GDA) achieves global linear convergence even in bilinear settings, while classical Alt-GDA and Sim-GDA can diverge or cycle (Lee et al., 16 Feb 2024). In zero-sum bilinear games, AltGDA guarantees an ergodic $O(1/T)$ convergence rate with constant stepsize; SimGDA is limited to $O(1/\sqrt{T})$ (Nan et al., 4 Oct 2025).
Nonconvex Problems and Stationarity: Block-wise AGD provably recovers second-order stationary points using random perturbations to escape saddles, with an explicit iteration bound $O(\mathrm{polylog}(d)/\epsilon^{7/3})$ for perturbed AGD (Lu et al., 2018).
Matrix Factorization: AGD achieves global linear convergence $T = O((\sigma_1/\sigma_r)^2 \log(1/\epsilon))$ for rank- $r$ factorization, given an appropriate random initialization, and this dependency is quadratic rather than cubic in the condition number (Ward et al., 2023).

3. Algorithmic Variants and Extensions

Alternating updates can be integrated with a wide range of enhancements:

Adaptive and Momentum Schemes: Momentum-accelerated proximal AGD (AltGDAm) attains improved iteration complexity $O(\kappa^{11/6}\epsilon^{-2})$ for regularized nonconvex minimax problems (Chen et al., 2021). AGDA+ further introduces nonmonotone adaptive step-size search via backtracking to eliminate global Lipschitz constant dependency and efficiently exploits local smoothness (Zhang et al., 20 Jun 2024).
Variance Reduction and Zeroth-Order Methods: VR-AGDA combines alternating updates with SVRG-style variance reduction to handle finite-sum structures efficiently (Yang et al., 2020). Zeroth-order AGDA (ZO-AGDA/ZO-VRAGDA) leverages randomized smoothing to obtain function-value-based gradient estimates, achieving complexity $\mathcal{O}(\epsilon^{-2})$ in deterministic and $\mathcal{O}(\epsilon^{-3})$ in stochastic settings for NC-PL minimax problems (Xu et al., 2022).
Alternating Gradient Flows (AGF): In feature learning analysis, AGF matches observed staircase loss curves in two-layer networks by alternating between utility-maximizing dormant neurons and loss-minimizing active ones, providing a mechanistic theory for sequential feature learning (Kunin et al., 6 Jun 2025).

4. Empirical and Practical Advantages

Empirical investigations repeatedly demonstrate that AGD provides robustness, accelerated convergence, and enhanced stability:

Provable Recovery under Strong Correlation: In non-negative matrix factorization, AGD provably recovers the ground-truth feature matrix even with high correlations in the latent weights—a regime where previous methods fail (Li et al., 2017).
Finite Regret and Bounded Dynamics in Games: Alternating updates eliminate linear regret growth and chaotic divergence observed in simultaneous updates, and instead produce bounded, cyclical trajectories that mirror Hamiltonian conservation properties (Bailey et al., 2019).
Multimodal Model Training: AGD is leveraged in Integrated Multimodal Perception (IMP) to alternate updates over heterogeneous modalities and objectives, avoiding complex batching and improving efficiency. Coupled with mixture-of-experts, AGD enables superior zero-shot video classification at drastically reduced training cost (Akbari et al., 2023).
Federated Learning with Statistical Heterogeneity: Alternating SGD steps decouple local and global objectives, reducing variance and stabilizing convergence in federated settings with non-i.i.d. data, outperforming traditional federated averaging techniques (Zhou et al., 2022).

5. Applications Across Domains

Alternating Gradient Descent underpins critical methodologies in:

Machine Learning: GAN training, adversarial learning, robust optimization, generative adversarial imitation learning, topic models, multimodal encoders, federated personalization.
Signal Processing: Fair principal component analysis (FPCA), beamforming, hyperspectral image factorization using alternating Riemannian/projected gradient descent-ascent (Xu et al., 2022).
Optimization and Game Theory: Zero-sum games, saddle-point problems, online regret minimization.
Deep Learning Optimizers: AGD-inspired auto-switchable optimizers use gradient differences for Hessian approximation and per-parameter switching, improving generalization in NLP, CV, and RecSys tasks (Yue et al., 2023).

6. Analysis, Limitations, and Future Directions

Despite extensive validation of AGD’s effectiveness, several theoretical and practical frontiers remain:

Worst-case Complexity and Coupling: Tight lower and upper bounds show AGD outperforms simultaneous update rules specifically as interaction or coupling increases, but pathological loss landscapes or boundary equilibria may present subtler behavior (Lee et al., 16 Feb 2024, Nan et al., 4 Oct 2025).
Adaptive Step-Size Mechanisms: Nonmonotone backtracking-based step-size adaptation (AGDA+) enables more aggressive steps by capitalizing on local smoothness, a promising alternative to conservative monotone schedules (Zhang et al., 20 Jun 2024).
Initialization Sensitivity: Appropriate (possibly asymmetric) initialization is sometimes crucial for AGD’s rapid convergence, especially in matrix factorization (Ward et al., 2023) and networks trained from small starts (Kunin et al., 6 Jun 2025).
Higher-Order and Multi-Block AGD: Extensions to multi-block, higher-order, and manifold-constrained settings (ARPGDA) are yielding competitive complexity matching Euclidean analogs (Xu et al., 2022).
Performance Estimation Programming (PEP): Analytical and numerical studies via PEP indicate AltGDA’s worst-case rates can be jointly optimized over step-size and iteration horizon, opening the door for further improvements in structured or constrained games (Nan et al., 4 Oct 2025).
Open Problems: Future work includes refining step-size selection rules, exploring AGD’s empirical gains under stochasticity, extending convergence theory to broader classes of non-interior equilibria and nonconvex structures, and integrating AGD dynamics into modular deep learning training schedules.

Alternating Gradient Descent thus represents both a powerful unifying principle in modern optimization and an active research area distinguished by its blend of practical efficiency, theoretical expressiveness, and adaptability across the challenging regimes of nonconvex, minimax, multi-block, and multimodal problems.