Gradient Descent's Last Iterate is Often (slightly) Suboptimal

Published 15 Apr 2026 in math.OC and cs.LG | (2604.13870v1)

Abstract: We consider the well-studied setting of minimizing a convex Lipschitz function using either gradient descent (GD) or its stochastic variant (SGD), and examine the last iterate convergence. By now, it is known that standard stepsize choices lead to a last iterate convergence rate of $\log T/\sqrt{T}$ after $T$ steps. A breakthrough result of Jain et al. [2019] recovered the optimal $1/\sqrt{T}$ rate by constructing a non-standard stepsize sequence. However, this sequence requires choosing $T$ in advance, as opposed to common stepsize schedules which apply for any time horizon. Moreover, Jain et al. conjectured that without prior knowledge of $T$, no stepsize sequence can ensure the optimal error for SGD's last iterate, a claim which so far remained unproven. We prove this conjecture, and in fact show that even in the noiseless case of GD, it is impossible to avoid an excess poly-log factor in $T$ when considering an anytime last iterate guarantee. Our proof further suggests that such (slightly) suboptimal stopping times are unavoidably common.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that any anytime stepsize schedule must suffer a poly-log overhead, establishing inherent last-iterate suboptimality in GD and SGD.
It employs novel lower-bound techniques on both deterministic and stochastic settings to show that optimal convergence without horizon awareness is impossible.
Implications underscore the necessity for iterate averaging or horizon-aware scheduling to achieve minimax-optimal rates in online and streaming applications.

Gradient Descent’s Last Iterate is Often (slightly) Suboptimal

Introduction and Problem Context

The paper "Gradient Descent's Last Iterate is Often (slightly) Suboptimal" (2604.13870) investigates the convergence properties of the last iterate in (sub)gradient descent (GD) and stochastic gradient descent (SGD) in convex, Lipschitz-continuous optimization. While classical theory ensures that the average of iterates can achieve the minimax-optimal rate of $1/\sqrt{T}$ after $T$ iterations using standard stepsizes, common practice and software implementations typically return the last iterate $x_T$ rather than this average. Prior theoretical results showed a corresponding last-iterate rate of only $\log T/\sqrt{T}$ under typical anytime stepsize schedules, motivating deeper investigation into whether this "poly-log" ( $\log T$ ) overhead is intrinsic and unavoidable in settings where $T$ is not fixed in advance.

Prior Work and Motivation

Jain et al. (2019) constructed a non-standard stepsize schedule depending explicitly on the known time horizon $T$ that achieves the optimal $1/\sqrt{T}$ rate for the last iterate. However, this non-anytime approach is impractical for settings with unknown or continually increasing $T$ . Jain et al. also conjectured that, in the absence of prior knowledge of $T$ , no stepsize schedule can achieve the optimal $T$ 0 last-iterate rate in SGD—leaving open whether the log-factor gap is a methodological artifact or a fundamental limitation.

Theoretical literature surrounding last-iterate performance has offered refinements, yet most either impose strong convexity, require averaging, or do not provide anytime guarantees. Lower bounds for standard smooth and non-smooth SGD with anytime stepsizes have typically matched or fallen short—but have not directly proved—the inevitability of last-iterate suboptimality with anytime guarantees.

Main Results and Technical Contributions

The principal result proves that for GD and, as a consequence, for SGD, no anytime stepsize schedule can guarantee $T$ 1 last-iterate convergence for all $T$ 2. Formally, for all stepsize schedules independent of $T$ 3 (i.e., anytime), the worst-case guarantee for the last-iterate error remains at least an $T$ 4 rate multiplied by a poly-log factor (specifically, at least $T$ 5 up to constants).

This lower bound is established constructively: even in the noiseless (deterministic) case of GD—where stochasticity is absent and thus optimism for improvement might be higher—one cannot eliminate this overhead factor without designing the stepsizes with full knowledge of $T$ 6. This both proves and strengthens the conjecture of Jain et al., showing the limitation holds for both stochastic and deterministic variants.

Further, the analysis demonstrates that achieving the optimal rate at a sequence of predetermined stopping times (e.g., through "doubling trick" schedules) is possible, but that the vast majority (in natural density) of stopping times will be suboptimal, suffering a poly-log overhead. In other words, optimality can only be achieved on a sparse (measure zero) subsequence of iterates.

Proof Techniques

The paper combines advanced lower bound techniques from optimization theory. The key tools are:

1D constructions demonstrating that the stepsize at step $T$ 7 is an inherent lower bound for the possible last-iterate error at $T$ 8.
Quadratic and worst-case convex function constructions ensuring that the sum of stepsizes cannot be too small, else excess error arises.
A high-dimensional lower bound drawing on and extending techniques from the analysis of "max-of-linear" functions where the behavior in different coordinate axes amplifies the inability to synchronize optimality with arbitrary stopping times.

The result leverages averaging arguments to show that for most $T$ 9 large enough, a randomly chosen stopping time exhibits this unavoidable poly-log overhead, crystallizing the "commonality" of suboptimal stopping.

Implications and Theoretical Impact

This work crystalizes a structural property of first-order methods (GD/SGD) in convex, Lipschitz settings. It formalizes that the inability to synchronize stepsize adaptation with unpredictable horizons is a fundamental barrier to optimal last-iterate convergence, even absent noise. Practically, this validates the widespread use of iterate averaging (or, in specific cases, schedule designs tied to anticipated $x_T$ 0) to achieve optimality and highlights that last-iterate solution quality can be inherently fragile in anytime/streaming contexts.

Theoretically, the result raises new questions regarding:

Strongly Convex Extension: The lower bound in the strongly convex case, where anytime rates remain separated from optimal, is open. The refined analysis of log factors vs. polynomial factors is potentially tractable via extensions of the present techniques.
Smoothness: In smooth optimization, the best-known anytime bounds involve (slightly worse) polynomial factors. The dichotomy between smooth and non-smooth cases—here, provably narrowed to a logarithmic exponent—invites further structural study.
Density of Optimal Stopping Times: The demonstration that optimal stopping times form a zero-density set under any stepsize schedule suggests avenues for characterizing, in the sense of asymptotic density, the sparsity of optimal last-iterate events and exploring potential schedule designs that target denser optimality.
Algorithmic Directions: For practitioners, this indicates that in online or open-horizon learning, one should use iteration averaging or develop early stopping rules tied to computational budgets rather than relying on last-iterate output.

Numerical and Analytical Strengths

The lower bound is strong in that it holds for all convex, Lipschitz objectives without requiring smoothness, second-order curvature, or strong convexity. The gap between the lower bound ( $x_T$ 1) and the best-known upper bound ( $x_T$ 2) is narrow, suggesting the analysis is near-tight for worst-case instances.

Conclusion

This paper resolves a fundamental conjecture about anytime last-iterate convergence for GD and SGD, establishing that a poly-log overhead is inherent when the computation budget (or horizon $x_T$ 3) is not predetermined. This result further sharpens the theory/practice gap regarding output rules in first-order methods and draws clear boundaries for what is information-theoretically achievable in unconstrained, anytime optimization. The implications for the design of optimization routines further underline the necessity of averaging or horizon-aware scheduling in practical large-scale machine learning. Prospective work aims to close the remaining tightness gap and extend the analysis to broader classes, including strongly convex and smooth objectives.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper studies a very common learning method called gradient descent (and its noisy version, stochastic gradient descent). It asks a simple practical question: if you stop the algorithm at any time and just take the latest solution it produced, how good is that solution? The authors show that, in general, the last solution you get will be slightly worse than the best-possible guarantee, unless you knew in advance exactly how many steps you were going to take.

The main idea in plain terms

Imagine you’re walking downhill to the lowest point in a valley (the best solution). Each step you take is guided by the slope under your feet (the “gradient”). How big your steps are is controlled by a “step size” or “learning rate.”

If you plan your trip to be exactly T steps long, you can choose a clever schedule of step sizes that gets you very close to the bottom after step T.
But if you want to be “anytime”—able to stop at any step and still have a strong guarantee—this paper proves you can’t always be as close to the bottom as in the planned case. You pay a small extra penalty.

What questions does the paper ask?

Can we make the last step (the “last iterate”) of gradient descent as good as the best theoretical rate, without knowing in advance how many steps we’ll take?
Is this possible even in the simplest, noiseless setting (plain gradient descent, not the stochastic/noisy version)?
If not, how much worse do we have to be?

How do they study this? (Methods explained simply)

To evaluate how close you get to the bottom, researchers look at “convergence rates,” which say how the error shrinks as the number of steps T grows. The gold standard in many convex problems is an error that shrinks like $1/\sqrt{T}$ .

Key ideas the authors use:

Step-size schedules: Think of a recipe telling you how big each step should be (big at the start, smaller later). Some schedules only work best if you know T beforehand.
Anytime setting: You don’t know when you’ll stop. You want a guarantee that’s good at every step t = 1, 2, 3, … This is realistic in practice.
Lower bounds: Instead of proposing a new algorithm, the authors prove impossibility results. They design “tricky terrains” (mathematical functions) that make any step-size schedule slip up a bit. They show that no matter what schedule you pick, there will be times when your last step can’t beat a certain small penalty factor.
Simple and high-dimensional tricks: They first show basic limits even in 1D (a “V”-shaped or quadratic hill), and then use a more advanced, high-dimensional construction (building on classic results) to get a stronger, general limit.

Analogy: It’s like making an obstacle course that is tailored to trip up any fixed walking strategy. No matter how you plan your pace, there’s a course that forces you to lose a tiny bit of speed compared to the ideal.

What did they find?

If you don’t know T in advance (the anytime setting), the error of the last iterate can’t always achieve the optimal $1/\sqrt{T}$ rate.
There must be at least a tiny extra factor—something like $\log(T)^{1/8}$ $lo g (T)^{1/8}$ —on top of $1/\sqrt{T}$ $1/ T$ in the worst case.
- In symbols: you can’t generally do better than about $\log(T)^{1/8}/\sqrt{T}$ . The ideal is $1/\sqrt{T}$ , so you pay a very small “poly-log” penalty.
This impossibility holds even for plain, noiseless gradient descent (not just the noisy stochastic version). That makes the result stronger.
“Suboptimal stopping times are common.” In everyday terms, the times when you suffer this small extra penalty aren’t rare; they happen regularly if you look across many stopping times.

Why this matters:

Previous work showed you can hit the optimal $1/\sqrt{T}$ rate for the last iterate if you pick a special step-size schedule that depends on knowing T in advance.
This paper proves that without knowing T ahead of time, you can’t generally achieve that perfect rate for the last iterate.

Why is this important?

Practical training often stops when you feel like it (anytime), and you usually keep the latest model (the last iterate).
This paper tells us there’s a fundamental trade-off: without pre-planning how long you’ll train, your last iterate will be just slightly worse than the absolute best theoretical rate.
The penalty is small (like $\log(T)^{1/8}$ ), but it’s unavoidable in the worst case.

What does this mean going forward? (Implications)

If you need perfect rates for the last model you return, you have two main options:
- Plan the number of steps T in advance and use a T-dependent step-size schedule.
- Or don’t return the last iterate: use the average of all iterates or other output tricks known to achieve the optimal $1/\sqrt{T}$ rate.
Open questions:
- Strongly convex problems: Can a similar impossibility be shown when the landscape curves upward more strongly?
- Tightness: The paper’s lower bound involves $\log(T)^{1/8}$ , while known upper bounds involve something like $\log(T)$ . Can we close this gap?
- How “often” can you get near-optimal last-iterate times without planning? Are the truly optimal times rare by necessity?

In short: If you want an algorithm that you can stop anytime and still keep the last model, expect to be extremely close to the best-possible error rate—but not exactly there. That tiny extra factor is the price of flexibility.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a single, focused list of concrete knowledge gaps and open questions that remain unresolved by the paper. Each item is framed to guide actionable follow-up research.

Tightness of the poly-log factor: Close the gap between the lower bound Ω(log^{{1/8}(T)/√T)} and the best-known anytime upper bound O(log(T)/√T) for the last iterate. Either strengthen the lower bound exponent beyond 1/8 or design an anytime stepsize schedule that improves the upper bound toward log^{{o(1)}(T)/√T.}
Strongly convex objectives: Establish an anytime lower bound for last-iterate convergence in the strongly-convex setting, clarifying whether the gap between O(log T/T) (anytime) and O(1/T) (known T) is necessary and determining the sharp poly-log factor.
Smooth convex (deterministic) setting: Determine whether poly-log (or larger) overheads are necessary for anytime last-iterate GD in smooth convex optimization, paralleling (or refuting) the non-smooth result and reconciling with recent acceleration results that currently are non-anytime or have polynomial gaps.
Dimensionality requirements: Quantify the minimal dimension needed to realize the lower bound constructions. Can a similar anytime impossibility (with any poly-log factor) be proved in one dimension, or is high dimensionality essential?
General first-order methods: Extend the impossibility result beyond vanilla (stochastic) gradient descent to broader classes of first-order methods, including momentum (Polyak, Nesterov), proximal-gradient, mirror descent, and adaptive preconditioning (e.g., AdaGrad, RMSProp), with or without line-search.
Geometry and norms: Investigate whether analogous anytime lower bounds hold under general norms and mirror maps (e.g., ℓ1, ℓ∞, or entropy geometry), and how the poly-log factor depends on geometry-dependent Lipschitz constants and Bregman divergences.
Output rules other than the last iterate: Characterize the best possible anytime rates for simple, implementable single-iterate selection rules that do not need to know T (e.g., randomized suffix selection or “anytime choose-one” rules), and determine whether any such rule can achieve O(1/√T) without poly-log factors.
Density of near-optimal stopping times: Prove or refute that any stepsize schedule has optimal-rate stopping times of zero natural density and suboptimal stopping times of density 1; more generally, characterize the maximal achievable density of stopping times with error O(1/√t).
Instance- and noise-dependence: Derive instance-dependent lower bounds capturing problem structure (e.g., margin/interpolation, curvature, or variance), and test whether low-noise or interpolation regimes admit anytime last-iterate rates closer to O(1/√T) without poly-log overhead.
High-probability lower bounds for SGD: Move beyond worst-case guarantees to establish high-probability anytime lower bounds for the last iterate in stochastic settings, and contrast them with bounds in expectation.
Parameter-free methods: Assess whether parameter-free algorithms that do not require knowledge of G or D (e.g., coin-betting, scale-free AdaGrad variants) can avoid or reduce the poly-log overhead for anytime last-iterate guarantees.
Constraints and domain assumptions: Explore how the impossibility depends on projection and bounded domains. Does a similar lower bound hold in unconstrained settings or under different constraint geometries, and how does domain diameter interact with the poly-log factor?
Lower-bound constructions: Develop alternative hard instances (beyond max-of-linear or quadratic-based constructions) that could yield stronger exponents or match known upper bounds; identify structural properties of functions that are maximally adversarial for anytime last-iterate GD.
Trade-offs in schedule design: Formalize and optimize the trade-off between the magnitude of the poly-log overhead and the “coverage” (density) of near-optimal stopping times; design schedules that maximize the fraction of times t with error ≤ c/√t while keeping worst-case overhead small.
Practical constants and verification: Compute explicit constants in the lower bound for canonical objective families and verify empirically how often and how large the poly-log overhead appears in practice, guiding stepsize design and stopping heuristics.
Online-to-batch implications: Translate the anytime last-iterate impossibility into online learning (regret) and statistical generalization contexts, clarifying how last-iterate suboptimality impacts test performance and whether alternative output rules can bridge the gap in practice.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Iteration-averaging as the default output rule in convex SGD/GD
- Sector: Software/ML platforms, finance, healthcare, advertising
- What to do: Replace “return last iterate” with Polyak–Ruppert averaging or suffix/tail averaging in training pipelines for convex (possibly non-smooth) models such as linear/logistic regression, SVMs with subgradient methods, or online convex optimization tasks.
- Tools/workflows: Add streaming running-average accumulators to optimizers (O(1) memory for simple averages; suffix windows when memory-limited). Update library defaults and fit() return values for convex solvers.
- Assumptions/dependencies: Most directly justified for convex Lipschitz objectives; effect on non-convex deep networks may differ (though tail averaging/SWA often helps). Requires tracking additional statistics during training.
Checkpoint and deployment at doubling times (the “doubling trick”)
- Sector: MLOps, edge/streaming systems, cloud training on preemptible instances
- What to do: Schedule learning-rate restarts and model snapshots at T ∈ {2^k}, and deploy only from those checkpoints to get near-optimal error at those times without pre-knowing the final horizon.
- Tools/workflows: Training schedulers that enforce powers-of-two checkpoints; CI/CD pipelines that only promote models at those checkpoints.
- Assumptions/dependencies: Guarantees are only at sparse (exponentially spaced) times; between them, last-iterate can be (poly-)log worse.
Horizon-aware stepsizes when T is known in advance
- Sector: Academia, benchmarking, production jobs with fixed budgets
- What to do: If the training budget is fixed, use horizon-dependent stepsize sequences (e.g., from Jain et al., 2019) to recover optimal 1/√T last-iterate performance.
- Tools/workflows: Configurable learning-rate schedules parameterized by target T; job schedulers that pass budget information to optimizers.
- Assumptions/dependencies: Requires reliable prior knowledge of T; not anytime.
Training dashboards that flag last-iterate risk under uncertain stop times
- Sector: MLOps observability
- What to do: Add indicators that warn when stopping is likely to be off a “good” checkpoint and recommend using averaged models or deferring snapshot to the next 2^k step.
- Tools/workflows: Plugins that compute and display proximity to the nearest doubling checkpoint; toggles to export averaged weights.
- Assumptions/dependencies: Theoretical lower bounds are worst-case; nevertheless, operational guardrails add robustness.
Procurement/benchmarking guidelines to report output rules
- Sector: Industry consortia, internal governance, academic benchmarks
- What to do: Require reporting whether results are last-iterate or averaged, and whether training horizon was pre-specified. Evaluate algorithms on random stop times in addition to fixed horizons.
- Tools/workflows: Benchmark protocols and leaderboards that include “anytime performance” metrics.
- Assumptions/dependencies: Simple to implement; helps avoid misleading last-iterate comparisons.
Online and edge learning with tail averaging for horizon-free performance
- Sector: Mobile personalization, recommender systems, IoT, smart grids
- What to do: Use suffix averaging or exponentially weighted moving averages (EWMAs) for models updated on-device or in streams, where stopping is unpredictable.
- Tools/workflows: Streaming aggregators with bounded memory; periodic export of averaged parameters for inference.
- Assumptions/dependencies: Convex tasks (e.g., online linear models) benefit most; for non-convex, test empirically.
AutoML and hyperparameter search that include output-rule as a choice
- Sector: AutoML platforms
- What to do: Treat “last iterate vs (tail) averaging” and “checkpoint cadence (e.g., 2^k)” as tunable knobs in hyperparameter search spaces.
- Tools/workflows: AutoML templates with output-rule selection and checkpoint schedules; evaluation at both fixed and random stop times.
- Assumptions/dependencies: Adds minor complexity; improves robustness under uncertain time budgets.
Library and API defaults for convex optimizers
- Sector: Open-source libraries (scikit-learn, PyTorch/TensorFlow wrappers for convex tasks), enterprise analytics stacks
- What to do: Default to returning an averaged solution for convex SGD/GD; expose a safe “anytime_export()” method that returns the current average or the most recent doubling checkpoint.
- Tools/workflows: Backward-compatible API changes with explicit flags.
- Assumptions/dependencies: Minimal computational overhead; clarifies expected performance.
Risk management in safety-critical training
- Sector: Healthcare diagnostics, credit scoring, industrial control
- What to do: Avoid relying on arbitrary stopping with last-iterate outputs. Mandate averaged outputs or horizon-aware schedules where budgets are fixed.
- Tools/workflows: Compliance checklists; model release gates that enforce these safeguards.
- Assumptions/dependencies: Aligns with worst-case guarantees; complements domain-specific validation.

Long-Term Applications

Densifying “good” stopping times beyond powers of two
- Sector: Optimization research, software tooling
- What to explore: Design restart/averaging schemes that yield a denser set of near-optimal stopping times while remaining horizon-free.
- Potential products: Optimizers with adaptive restart calendars that produce many safe-deploy checkpoints.
- Assumptions/dependencies: Requires theoretical advances beyond current doubling-trick practice.
Adaptive output rules that infer “effective horizons”
- Sector: ML research, AutoML
- What to explore: Meta-learning or control-based policies that use training signals (e.g., gradient norms, validation error curvature) to predict near-optimal snapshot times without explicit T.
- Potential products: Smart checkpointing modules integrated into trainers.
- Assumptions/dependencies: Heuristics may work well in practice but won’t beat worst-case lower bounds.
Smoothed/problem-transformed pipelines to improve anytime behavior
- Sector: Optimization in industry analytics
- What to explore: Apply smoothing (e.g., Moreau envelope, proximal regularization) to convert non-smooth convex problems into smooth ones where different anytime guarantees may apply, balancing bias vs. convergence.
- Potential products: Preprocessing modules that auto-tune smoothing strength.
- Assumptions/dependencies: Introduces approximation bias; needs careful validation.
Extensions to strongly-convex and smooth regimes with improved anytime gaps
- Sector: Academia and applied optimization
- What to explore: Close the gap between known upper bounds (e.g., log(T)/T vs 1/T) and new lower bounds in anytime settings; develop practical schedules reflecting the best-possible anytime rates.
- Potential products: New “anytime-optimized” learning-rate schedules for strongly-convex objectives.
- Assumptions/dependencies: Ongoing theoretical work; results will inform next-gen optimizers.
Standardized “anytime” evaluation in benchmarks and audits
- Sector: Policy/standards, regulatory audits
- What to explore: Introduce standardized tests that sample random stop times and report performance distributions, not just end-of-run metrics.
- Potential products: Audit toolkits that simulate preemptions/interruptions.
- Assumptions/dependencies: Community adoption needed; ties into responsible AI documentation.
Hardware/firmware support for streaming averaging and scheduled snapshots
- Sector: Edge devices, robotics, embedded systems
- What to explore: Accelerator firmware APIs that maintain running/tail averages with negligible overhead and trigger snapshots at configured schedules.
- Potential products: On-device learning SDKs with built-in averaging/checkpoint primitives.
- Assumptions/dependencies: Requires vendor support; benefits on-device online learning tasks.
Robust online controllers with averaged parameter policies
- Sector: Robotics, autonomous systems, energy systems
- What to explore: Controllers that maintain and deploy averaged parameters for stability when updates occur at unpredictable times, reducing sensitivity to last-iterate suboptimality.
- Potential products: Middleware that blends fast adaptation with averaged safety baselines.
- Assumptions/dependencies: Must reconcile latency and stability trade-offs; domain validation required.
Policy guidance on transparency of training/stopping protocols
- Sector: Governance, compliance
- What to explore: Guidelines that require disclosure of stopping protocols and output rules (last iterate vs averaged) for online-learning systems in regulated domains.
- Potential products: Model cards and documentation standards incorporating “anytime behavior.”
- Assumptions/dependencies: Coordination with regulators and standards bodies.

Notes on Assumptions and Scope

The paper’s lower bound is proved for worst-case convex, Lipschitz (non-smooth) objectives and highlights high-dimensional constructions; many practical tasks may not be worst-case.
Results apply to last-iterate performance without prior knowledge of T; average-iterate guarantees remain optimal with standard schedules.
For non-convex deep learning, the theory doesn’t directly apply; however, the operational lesson—avoid relying on arbitrary last-iterate snapshots—often aligns with practice (e.g., tail averaging/SWA).
Implementations may require tracking running averages, scheduling checkpoints, and modest changes to deployment pipelines.

View Paper Prompt View All Prompts

Glossary

Anytime convergence rate guarantee: A bound that controls the last-iterate error uniformly over all times t, without knowing the horizon in advance. "is an anytime convergence rate guarantee"
Average iterate: The mean of the iterates up to time T, often enjoying better theoretical convergence with standard stepsizes than the final iterate. "the error of the average iterate"
Convex Lipschitz function: A convex function whose value changes at most linearly with distance, i.e., |f(x)−f(y)| ≤ G||x−y||. "a convex Lipschitz function"
Diameter (of a domain): The maximal distance between any two points in the domain. "diameter at most D"
Doubling trick: A scheduling technique that runs time-dependent procedures in phases of doubling lengths to obtain anytime guarantees. "doubling trick"
Empirical risk minimization: Optimization problems structured as minimizing the average loss over a dataset. "objectives with an empirical risk minimization structure"
Gradient descent (GD): A deterministic first-order optimization method that updates using exact (sub)gradients. "gradient descent (GD)"
Harmonic numbers: The partial sums of reciprocals, often appearing in bounds and analyses. "H_n:=\sum_{i=1}^{{n}\frac{1}{i}"}
High-dimensionality: The role of large ambient dimension in constructing lower bounds or hardness examples. "high-dimensionality and not knowing when the algorithm should stop."
High probability bounds: Probabilistic guarantees that hold with probability close to one. "tight high probability bounds"
Information-theoretically optimal rate: The best achievable convergence rate dictated by fundamental (statistical) limits. "information-theoretically optimal rate"
Interpolation (near-) or low-noise regime: Settings where training data can be (nearly) fit exactly or stochastic gradient noise is small. "the so called (near-)interpolation or low-noise regime"
Last iterate convergence: The convergence behavior of the final iterate x_T, as opposed to averages or other output rules. "last iterate convergence"
Learning rate: Another name for the stepsize controlling update magnitudes in iterative methods. "also referred to as learning rate"
Max-of-linear functions: Functions defined as the maximum over linear forms, used to craft lower bounds in convex optimization. "max-of-linear functions"
Natural density: The asymptotic proportion of integers in a subset of N, used to formalize frequency of events like “good stopping times.” "positive natural density"
Noiseless case: A setting without stochastic noise, i.e., deterministic gradients/subgradients as in GD. "noiseless case of GD"
Poly-log factor: A multiplicative overhead that is a polynomial in log T. "excess poly-log factor in T"
Projection operator: The mapping that sends a point to its nearest point in a convex set. "the projection operator onto A"
Smooth convex deterministic optimization: Optimization of convex functions with Lipschitz-continuous gradients in a noise-free setting. "smooth convex deterministic optimization"
Stepsize sequence: The schedule of learning rates {η_t} used over iterations. "stepsize sequence"
Stochastic gradient descent (SGD): A first-order method using unbiased stochastic gradients. "stochastic gradient descent (SGD)"
Stopping time: The iteration at which the algorithm is halted; may be unknown a priori in anytime settings. "whenever the stopping time is not carefully chosen in advance."
Strongly-convex objectives: Convex objectives with a uniform curvature lower bound, yielding faster rates. "strongly-convex objectives"
Sub-gradient method: An optimization method for non-differentiable convex functions using subgradients. "the sub-gradient method"
Sub-gradient set: The set of all subgradients of a convex function at a point. "the sub-gradient set"
Zero-density set: A set whose natural density is zero, meaning it occurs with asymptotically vanishing frequency. "zero-density set."

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Gradient Descent's Last Iterate is Often (slightly) Suboptimal

Summary

Gradient Descent’s Last Iterate is Often (slightly) Suboptimal

Introduction and Problem Context

Prior Work and Motivation

Main Results and Technical Contributions

Proof Techniques

Implications and Theoretical Impact

Numerical and Analytical Strengths

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

The main idea in plain terms

What questions does the paper ask?

How do they study this? (Methods explained simply)

What did they find?

Why is this important?

What does this mean going forward? (Implications)

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Assumptions and Scope

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Gradient Descent's Last Iterate is Often (slightly) Suboptimal

Summary

Gradient Descent’s Last Iterate is Often (slightly) Suboptimal

Introduction and Problem Context

Prior Work and Motivation

Main Results and Technical Contributions

Proof Techniques

Implications and Theoretical Impact

Numerical and Analytical Strengths

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

The main idea in plain terms

What questions does the paper ask?

How do they study this? (Methods explained simply)

What did they find?

Why is this important?

What does this mean going forward? (Implications)

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Assumptions and Scope

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research