An Improved Last-Iterate Convergence Rate for Anchored Gradient Descent Ascent

Published 4 Apr 2026 in math.OC and cs.AI | (2604.03782v1)

Abstract: We analyze the last-iterate convergence of the Anchored Gradient Descent Ascent algorithm for smooth convex-concave min-max problems. While previous work established a last-iterate rate of $\mathcal{O}(1/t^{2-2p})$ for the squared gradient norm, where $p \in (1/2, 1)$, it remained an open problem whether the improved exact $\mathcal{O}(1/t)$ rate is achievable. In this work, we resolve this question in the affirmative. This result was discovered autonomously by an AI system capable of writing formal proofs in Lean. The Lean proof can be accessed at https://github.com/google-deepmind/formal-conjectures/pull/3675/commits/a13226b49fd3b897f4c409194f3bcbeb96a08515

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces novel step-size schedules for Anchored GDA, achieving an optimal O(1/t) convergence rate for the squared gradient norm.
It employs a discrete, non-ergodic analysis with formal machine verification using Lean 4 to rigorously confirm the convergence proof.
The work offers practical benefits in robust and stochastic optimization settings by reducing variance and improving algorithm stability.

Improved Last-Iterate Rate for Anchored Gradient Descent Ascent in Convex-Concave Min-Max Problems

Problem Setting and Context

The paper addresses the last-iterate convergence behavior of the Anchored Gradient Descent Ascent (Anchored GDA) algorithm when applied to smooth convex-concave min-max objectives. Such problems, characterized by $\min_x \max_y L(x, y)$ with $L$ convex in $x$ and concave in $y$ , appear pervasively in adversarial learning, domain adaptation, optimal transport, robust training, and fairness-aware learning. Traditional GDA suffers from inherent oscillations in this regime. Numerous alternatives (e.g., Extragradient [korpelevich1976extragradient], Optimistic GDA [popov1980modification], and Halpern iteration [Halpern1967FixedPO]) have been empirically and theoretically explored, but robust step-size design for efficient last-iterate convergence with a single gradient call remains a challenge, especially under stochastic gradients.

Anchored GDA introduces an anchoring term in the gradient-based update, pulling iterates towards a fixed point—typically the initialization. Prior theoretical results [ryu2019ode] established an $\mathcal{O}(1/t^{2-2p})$ rate (for $p \in (1/2,1)$ ) on the squared gradient norm at the last iterate, but the optimal non-ergodic $\mathcal{O}(1/t)$ rate remained unresolved. The current work settles this question affirmatively, demonstrating that Anchored GDA achieves $\mathcal{O}(1/t)$ last-iterate rates under standard monotonicity and Lipschitz assumptions.

Algorithmic Framework and Theoretical Advances

Let $z_t = (x_t, y_t)$ denote the joint iterate at time $t$ , $L$ 0 the anchor, and $L$ 1 the gradient operator. Anchored GDA updates are given by

$L$ 2

with time-dependent step-sizes $L$ 3 and anchoring parameters $L$ 4.

Key assumptions for the analysis are:

Monotonicity of $L$ 5;
Lipschitz continuity of $L$ 6 (constant $L$ 7);
Existence of at least one saddle point $L$ 8 ( $L$ 9).

Previous analyses (notably [ryu2019ode]) utilized parameter schedules of the form $x$ 0, $x$ 1 (with $x$ 2) yielding a last-iterate convergence rate of $x$ 3. This left the $x$ 4 regime unresolved for the theoretically optimal $x$ 5.

This work proposes revised schedules:

$x$ 6

Detailed recurrence analysis in the discrete setting—eschewing continuous-time ODE analogies—yields the main result:

Theorem: Under standard assumptions, Anchored GDA with the above schedules satisfies for all $x$ 7

$x$ 8

where $x$ 9 depends on the Lipschitz constant $y$ 0, initialization $y$ 1, and schedule parameter $y$ 2.

The proof structure is:

Boundedness of iterates: Establish a global bound on $y$ 3;
Iterate stability: Control the norm $y$ 4 by exploiting the strong contraction from the anchoring;
Non-ergodic convergence: Transfer these bounds to the gradient operator, yielding last-iterate guarantees.

Notably, the results do not require uniqueness of the saddle point, nor do they rely on averaged/ergodic iterates, which are less useful in non-convex landscapes and in stochastic optimization settings.

Numerical and Analytical Implications

The established $y$ 5 rate for the squared gradient norm is tight for the class of monotone variational inequalities with Lipschitz continuous operators and is competitive with the best known rates for single-call algorithms in deterministic settings. The result also demonstrates that commonly adopted, intuitively motivated schedules (polynomial or sublinear) may attain strictly suboptimal asymptotics.

Anchored GDA further offers practical advantages in stochastic scenarios: the anchoring term serves as a variance reduction mechanism, mitigating the instability and noise amplification inherent in methods requiring double-sampling (e.g., Extragradient). This makes the approach particularly compelling for applications with high-variance gradient oracles, including GAN training, robust optimization, and reinforcement learning [goodfellow2014generative, madry2018towards, du2017stochastic].

Formal Verification and The Role of AI

A distinct aspect of this work is the formalization and machine-verification of the convergence proof using Lean 4, with the proof autonomously generated by a formal-mathematics AI agent developed at Google DeepMind. The analytic derivations (e.g., contraction lemmas, parameter schedule asymptotics) and non-asymptotic technical bounds are rigorously checked, advancing the state-of-the-art in verified optimization for non-ergodic, non-convex regimes.

This direction is likely to affect future theoretical work, both by setting higher standards for correctness in mathematical optimization, and by accelerating the discovery of intricate convergence results for new algorithmic schemes. The modularity of the analysis paves the way for extensions to structured monotone inclusions, tighter parameter tuning, and the formal exploration of stochastic rate results.

Conclusion

This work resolves the last-open question regarding the non-ergodic convergence rate of Anchored GDA for smooth convex-concave min-max objectives, establishing the scheme attains the optimal $y$ 6 rate on the squared norm of the gradient operator at the last iterate. The proof is notable for its direct, discrete analysis and for being fully machine-verified by an autonomous agent, highlighting both the enduring utility of anchoring and the emerging role of AI in formal mathematical discovery. Implications extend to the principled design of single-call algorithms for robust, efficient equilibrium computation in game-theoretic machine learning problems. Future research will likely involve extending these guarantees to stochastic, non-monotone, or structured settings, facilitated by formal methods and AI-driven proof assistants.