Papers
Topics
Authors
Recent
Search
2000 character limit reached

Online Gradient Descent (OGD): An Overview

Updated 26 June 2026
  • Online Gradient Descent (OGD) is a sequential optimization method that updates decisions using negative gradient steps and projections onto feasible sets.
  • It achieves optimal regret bounds like O(√T) and O(log T) under convex and strongly convex loss conditions by strategically tuning its step sizes.
  • Extensions of OGD include pairwise, kernelized, proximal, and variance-reduced variants, making it robust for large-scale, streaming, and non-convex learning tasks.

Online Gradient Descent (OGD) is a foundational algorithmic paradigm for solving sequential optimization problems where loss functions are revealed over time and decisions must be updated adaptively based on current information. In its canonical form, OGD iteratively updates a decision variable by stepping in the direction of the negative (sub)gradient of the most recently observed loss, followed by a projection onto the feasible set if necessary. OGD and its variants are central to online learning, adversarial optimization, large-scale convex and non-convex learning, and robust streaming data analysis. Its performance is characterized by regret bounds—measures of excess loss relative to the best static action in hindsight—often attaining optimal rates for broad classes of problems. Over the past two decades, OGD's theoretical properties, algorithmic refinements, and extensions to complex learning scenarios (pairwise, kernelized, stochastic/inexact, adaptive, and beyond) have been extensively analyzed and applied in contemporary machine learning and online decision-making.

1. Canonical OGD: Framework and Baseline Guarantees

In standard online convex optimization (OCO), the learner selects a point xtx_t from a convex feasible set KRdK \subseteq \mathbb{R}^d at each round t=1,2,,Tt=1,2,\dots,T. After making the selection, a convex loss function ft:KRf_t: K \to \mathbb{R} is revealed and incurred. The OGD update is:

xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),

where ΠK\Pi_K denotes Euclidean projection onto KK and ηt\eta_t is the step size. Regret is defined by

RT=t=1Tft(xt)minxKt=1Tft(x).R_T = \sum_{t=1}^T f_t(x_t) - \min_{x \in K} \sum_{t=1}^T f_t(x).

For general convex Lipschitz losses, OGD achieves RT=O(T)R_T = O(\sqrt{T}) with KRdK \subseteq \mathbb{R}^d0. For KRdK \subseteq \mathbb{R}^d1-strongly convex losses, setting KRdK \subseteq \mathbb{R}^d2 yields KRdK \subseteq \mathbb{R}^d3, which is information-theoretically optimal for this regime (Jordan et al., 2023, Garber, 2018).

2. Advanced Regimes: Strong Convexity, Exp-Concavity, and Hidden-Convexity

OGD is "doubly optimal" in two senses:

  • For strongly convex costs, OGD attains KRdK \subseteq \mathbb{R}^d4 regret with a KRdK \subseteq \mathbb{R}^d5-decaying step size (Jordan et al., 2023).
  • For multi-agent strongly monotone games, simultaneous OGD by all agents ensures KRdK \subseteq \mathbb{R}^d6 last-iterate convergence to Nash equilibrium (Jordan et al., 2023).

When the strong convexity parameter is unknown, adaptive OGD variants such as AdaOGD use randomized step sizes (e.g., drawn from a geometric distribution) to attain KRdK \subseteq \mathbb{R}^d7 regret, matching the optimal KRdK \subseteq \mathbb{R}^d8 up to logarithmic factors (Jordan et al., 2023). Similarly, for exp-concave losses, AdaONS achieves KRdK \subseteq \mathbb{R}^d9 regret (where t=1,2,,Tt=1,2,\dots,T0 is the dimension).

In "hidden-convex" online learning, losses are non-convex in the decision variable but become convex after a smooth invertible reparameterization. Under geometric "Hessian compatibility"—i.e., the existence of a potential whose Hessian matches the pullback metric of the reparameterization—OGD achieves the standard t=1,2,,Tt=1,2,\dots,T1 regret bound despite non-convexity. Failure of this geometric condition provably leads to linear regret, establishing the necessity of metric integrability for optimality (Barakat et al., 25 May 2026).

3. Model Extensions: Pairwise, Kernelized, and Limited-Memory OGD

Online pairwise learning, where losses depend on pairs of examples (such as AUC maximization or metric learning), historically required t=1,2,,Tt=1,2,\dots,T2 computational cost if naively updating against all past pairs. Practical OGD variants address this as follows:

  • Short-buffer OGD: Pair each new example only with the previous example, reducing computation to t=1,2,,Tt=1,2,\dots,T3 per update. This achieves optimal statistical rates (e.g., t=1,2,,Tt=1,2,\dots,T4 generalization in convex cases, t=1,2,,Tt=1,2,\dots,T5 for PL objectives) while maintaining minimal memory and computation. Recent work provides argument stability, optimization, and privacy guarantees for buffer size t=1,2,,Tt=1,2,\dots,T6 (Yang et al., 2021).
  • Efficient kernelization: Applying Random Fourier Features (RFF) to approximate shift-invariant kernels allows OGD-type algorithms to operate in high-dimensional RKHSs with only t=1,2,,Tt=1,2,\dots,T7 feature dimensions, incurring a small uniform kernel approximation error. Stratified or moving-average buffering reduces both computation and gradient variance, ensuring sublinear regret even under non-i.i.d. streams and adversarial ordering (AlQuabeh et al., 2023, AlQuabeh et al., 2024).
  • Buffer and dynamic averaging: Limiting the buffer to t=1,2,,Tt=1,2,\dots,T8 representatives (via clustering or moving averages) and combining with random past samples or stratified sampling leads to unbiased, variance-reduced gradient estimators and state-of-the-art tradeoffs between memory, computation, and regret in large-scale streaming applications (AlQuabeh et al., 2023, AlQuabeh et al., 2024).

4. Inexact, Proximal, and Variance-Reduced OGD

Many learning problems involve composite objectives or prohibitively expensive full gradient computation:

  • Inexact Proximal OGD: Generalizes OGD to handle composite losses t=1,2,,Tt=1,2,\dots,T9, where ft:KRf_t: K \to \mathbb{R}0 is differentiable (possibly strongly convex) and ft:KRf_t: K \to \mathbb{R}1 is convex but possibly non-smooth. Only inexact gradients ft:KRf_t: K \to \mathbb{R}2 with bounded error are assumed. The regret is then controlled by the sum of gradient errors and the path length of the best comparator (dynamic regret). For losses with a finite-sum structure, SVRG-style variance reduction techniques can be incorporated, yielding dynamic regret ft:KRf_t: K \to \mathbb{R}3 under mild regularity (Dixit et al., 2018).
  • Stochastic and dependent feedback: For stochastic differential equations, OGD—viewed as stochastic mirror descent—can be rigorously applied even when subgradients are biased and temporally dependent. Uniform regret bounds emerge through the interplay of mirror descent theory, ergodicity of the underlying process, and approximation control in surrogate losses (Nakakita, 2022).

5. Regret Beyond Classical Convexity: Curvature and Conditioning

Standard OGD assumes convexity but not curvature. However, polyhedral domains and loss structures with curvature (non-strongly-convex) permit sharper rates. When the loss functions satisfy certain quadratic-growth or low-rank conditions (e.g., via Hoffman's lemma for polytopes), OGD attains ft:KRf_t: K \to \mathbb{R}4 regret—matching offline Newton step methods—yet with only ft:KRf_t: K \to \mathbb{R}5 time and memory per round (Garber, 2018). This confers computational tractability in large-scale settings.

Adaptive preconditioning via per-coordinate learning rates ("online conditioning") further improves regret bounds when gradients are sparse or anisotropic. The diagonal preconditioner (coordinate-wise normalization by accumulated squared gradients) attains regret

ft:KRf_t: K \to \mathbb{R}6

which can substantially outperform isotropic OGD, especially in high-dimensional regimes with non-uniform feature frequencies (Streeter et al., 2010).

6. Computational, Memory, and Practical Aspects

Table: Practical aspects of OGD variants

Variant Per-iteration Cost Memory
Standard OGD ft:KRf_t: K \to \mathbb{R}7 ft:KRf_t: K \to \mathbb{R}8
Short-buffer OGD (ft:KRf_t: K \to \mathbb{R}9 pairwise) xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),0 xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),1
Buffer-based OGD (xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),2 buffer) xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),3 xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),4
Kernel OGD via RFF (xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),5 features) xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),6 xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),7
Online conditioning (per-coordinate) xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),8 xt+1=ΠK(xtηtft(xt)),x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t)),9
Proximal/variance-reduced OGD ΠK\Pi_K0 (ΠK\Pi_K1 minibatch) ΠK\Pi_K2

Experimental results across multiple large-scale datasets (e.g., a9a, MNIST, RCV1, Web advertising) consistently show that memory-efficient, adaptive, and kernelized OGD variants outperform or match more computationally intensive baselines—such as full-batch ONS—with significantly lower runtime and resource requirements (AlQuabeh et al., 2023, AlQuabeh et al., 2024, Streeter et al., 2010, Garber, 2018).

7. Limitations, Open Problems, and Extensions

The primary limitations and open questions for OGD and its variants include:

  • For hidden-convex nonconvex optimization, the necessity and sufficiency of Hessian-compatibility for sublinear regret exposes a geometric barrier. Extensions to other structure classes remain open (Barakat et al., 25 May 2026).
  • In the bandit setting, OGD with spherical smoothing matches the ΠK\Pi_K3 regret of convex OCO, but obtaining ΠK\Pi_K4 rates in this setting—especially for hidden-convex or nonconvex losses—remains an active area (Barakat et al., 25 May 2026).
  • Decentralized and fully adaptive algorithms for unknown curvature exhibit logarithmic factors in optimality gaps. Tightening these bounds or designing parameter-free methods without such penalties remains of interest (Jordan et al., 2023).
  • In the context of streaming, non-i.i.d., or adversarial data, robustification of variance reduction and dynamic regret minimization requires further investigation, particularly for kernel and high-dimensional models (AlQuabeh et al., 2023, AlQuabeh et al., 2024).

A plausible implication is that OGD remains the core meta-algorithm for a wide array of sequential learning problems. Continued advances in adaptive control, preconditioning, memory-bounded and privacy-preserving schemes, as well as precise geometric characterizations, will likely further expand its applicability and practical impact.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Online Gradient Descent (OGD).