Papers
Topics
Authors
Recent
Search
2000 character limit reached

Online Stochastic Mirror Descent (OSMD)

Updated 27 February 2026
  • Online Stochastic Mirror Descent is a versatile optimization method that uses mirror maps and Bregman divergence to adapt to non-Euclidean domains in online and stochastic settings.
  • It applies iterative updates in both dual and primal forms, recovering multiplicative weights on the simplex and projected SGD in Euclidean spaces.
  • It offers rigorous convergence and regret guarantees under diverse gradient conditions, proving useful in online learning, federated learning, and multi-stage stochastic programming.

Online Stochastic Mirror Descent (OSMD) is a general-purpose first-order method for online and stochastic convex optimization in arbitrary geometries. It exploits problem structure via a mirror map and Bregman divergence, producing updates that adapt to non-Euclidean domains, constraints, and statistical settings. OSMD is foundational in online learning, stochastic approximation, multi-stage stochastic programming, federated learning, and beyond. Its convergence and regret guarantees are now tightly unified across diverse regimes, including noisy/biased/adversarial gradients and complex constraint architectures.

1. Algorithmic Foundations and Update Structure

The OSMD framework operates over a convex set WRdW \subset \mathbb{R}^d, possibly equipped with a non-Euclidean norm. A Legendre-type mirror map Φ:WR\Phi: W \to \mathbb{R}, assumed to be Fréchet differentiable and σΦ\sigma_\Phi-strongly convex, induces a Bregman divergence: DΦ(u,v):=Φ(u)Φ(v)Φ(v),uvσΦ2uv2.D_\Phi(u,v) := \Phi(u) - \Phi(v) - \langle \nabla\Phi(v), u - v \rangle \geq \frac{\sigma_\Phi}{2} \|u-v\|^2. Given i.i.d. data z1,z2,z_1, z_2, \ldots and convex per-sample loss (w;z)\ell(w;z), OSMD iteratively computes

(Dual form):Φ(wt+1)=Φ(wt)ηtw(wt;zt),\text{(Dual form):}\quad \nabla\Phi(w_{t+1}) = \nabla\Phi(w_t) - \eta_t \nabla_w \ell(w_t; z_t),

or equivalently,

(Primal form):wt+1=argminwW{w(wt;zt),wwt+1ηtDΦ(w,wt)}.\text{(Primal form):}\quad w_{t+1} = \arg\min_{w \in W} \left\{ \langle \nabla_w\ell(w_t;z_t), w - w_t \rangle + \frac{1}{\eta_t} D_\Phi(w, w_t) \right\}.

When applied on the simplex with the negative-entropy mirror map, this recovers the multiplicative weights update; for the Euclidean geometry, it reduces to projected SGD. The step size ηt\eta_t governs the learning dynamics, with convergence determined by its asymptotics (Lei et al., 2018, Srebro et al., 2011, Gasnikov et al., 2014).

2. Convergence Theory: Conditions and Rates

Required Step-Size Regimes

  • Positive-variance (noisy gradients): Convergence in expectation requires limtηt=0\lim_{t \to \infty} \eta_t = 0 and t=1ηt=\sum_{t=1}^\infty \eta_t = \infty.
  • Zero-variance (oracle/noiseless/empirical risk minimization): The condition reduces to t=1ηt=\sum_{t=1}^{\infty} \eta_t = \infty; constant ηt\eta_t are admissible under strong convexity.

Convergence Guarantees

Let ww^* minimize the population risk F(w)=E[(w;Z)]F(w) = \mathbb{E}[\ell(w;Z)], with (;z)\ell(\cdot;z) LL-smooth and F,ΦF, \Phi satisfying a local convexity-control condition. Then:

  • Positive-variance regime: Under mild assumptions (infwE[(w;Z)]>0\inf_w \mathbb{E}[\|\nabla \ell(w;Z)\|_*] > 0), necessary and sufficient conditions on step size as above guarantee

E[DΦ(w,wT)]0.\mathbb{E}[D_\Phi(w^*, w_T)] \to 0.

In general (no strong convexity), no rate faster than O(1/T)O(1/T) is possible. With strong convexity, selecting ηt=4/[(t+1)σF]\eta_t = 4 / [(t+1)\sigma_F] yields the minimax O(1/T)O(1/T) rate.

  • Zero-variance regime: Assumes E[(w;Z)]=0\mathbb{E}[\|\nabla \ell(w^*;Z)\|_*] = 0. Constant step size gives linear convergence up to multiplicative constants:

c1TDΦ(w,w1)E[DΦ(w,wT)]c2TDΦ(w,w1).c_1^T D_\Phi(w^*, w_1) \leq \mathbb{E}[D_\Phi(w^*, w_T)] \leq c_2^T D_\Phi(w^*, w_1).

  • Almost sure convergence: If t=1ηt=\sum_{t=1}^\infty \eta_t = \infty, t=1ηt2<\sum_{t=1}^{\infty} \eta_t^2 < \infty, and problem conditions hold, then DΦ(w,wt)0D_\Phi(w^*, w_t) \to 0 almost surely (Lei et al., 2018).
  • General regret minimization: For regret relative to ww^*, optimal rates O(T)O(\sqrt{T}) or O(1/T)O(1/T) are achieved depending on convexity structure (Srebro et al., 2011, Gasnikov et al., 2014).

3. Geometry, Mirror Maps, and Domain Adaptation

The essence of OSMD is the selection of a mirror map Φ\Phi capturing the geometry of the feasible set or prior knowledge:

Mirror Map Domain Bregman Divergence Strong Convexity
12w22\frac{1}{2}\|w\|_2^2 Rd\mathbb{R}^d 12uv22\frac{1}{2}\|u-v\|_2^2 1
i=1dwilogwi\sum_{i=1}^d w_i \log w_i Probability simplex i=1duiln(ui/vi)\sum_{i=1}^d u_i \ln(u_i/v_i) (KL) 1 (in 1\ell_1-norm)
12wp2\frac{1}{2}\|w\|_p^2, 1<p21<p\leq 2 Rd\mathbb{R}^d --- σΦp1\sigma_\Phi \asymp p-1

Strong convexity and smoothness properties of Φ\Phi control stability and rates. The choice of Φ\Phi allows OSMD to natively handle sparse policies, simplex constraints, or adaptivity to group-norm, trace-norm, or Mahalanobis geometries (Gasnikov et al., 2014, Srebro et al., 2011, Lei et al., 2018).

4. Algorithmic and Statistical Extensions

Biased and Dependent Gradients

In structured problems (e.g., SDE parameter estimation, Markov data), gradients are typically biased and temporally dependent. The OSMD framework extends with minor modifications: θi+1=argminθΘ{Ki,n(θi),θ+1ηiDψ(θ,θi)},\theta_{i+1} = \arg\min_{\theta \in \Theta} \left\{ \langle K_{i,n}(\theta_i), \theta \rangle + \frac{1}{\eta_i} D_\psi(\theta, \theta_i) \right\}, where Ki,nK_{i,n} are biased proxies for true gradients (Nakakita, 2022). Under suitable mixing and moment control,

sup(a,b)Sϖ,θΘE[fa,b(θˉn)fa,b(θ)]O(log(nhn2)nhn2+hnβ/2),\sup_{(a,b)\in S_\varpi, \theta\in\Theta} \mathbb{E}[f^{a,b}(\bar{\theta}_n) - f^{a,b}(\theta)] \leq O\left(\frac{\log(n h_n^2)}{\sqrt{n h_n^2}} + h_n^{\beta/2}\right),

in SDE drift estimation, with analogous results for diffusion coefficients.

Multi-Stage and Asynchronous Settings

In multi-stage stochastic programming, OSMD operates on a filtered probability space, producing stage-wise feasible decisions by interacting with “stochastic conditional gradient oracles”: Xt(l+1)(w)=argminxtXtγlG~t(l)(w),xt+Dvt(xt,Xt(l)(w))X_{t}^{(l+1)}(w) = \arg\min_{x_t \in X_t} \left\langle \gamma_l \tilde{G}_t^{(l)}(w), x_t \right\rangle + D_{v_t}(x_t, X_t^{(l)}(w)) with oracles controlling bias/variance and respecting non-anticipativity constraints. An asynchronous “semi-online” protocol reduces oracle complexity from exponential to linear in the number of stages (Zhang et al., 18 Jun 2025).

Optimistic, Adaptive, and Bandit Variants

  • Optimistic OSMD: Incorporates predictions of next gradients to control regret via cumulative stochastic variance σ1:T2\sigma_{1:T}^2 and adversarial variation Σ1:T2\Sigma_{1:T}^2. For convex smooth losses:

E[RegT]=O(Dσ1:T2+DΣ1:T2)\mathbb{E}[\text{Reg}_T] = O(D \sqrt{\sigma_{1:T}^{2}} + D \sqrt{\Sigma_{1:T}^{2}})

interpolating between purely stochastic and adversarial regimes (Chen et al., 2023).

  • Bandit/importance sampling regimes: OSMD is applied for online adaptive variance reduction (e.g., federated client sampling, coordinate descent) where only partial feedback is revealed. Updates are performed on restricted simplices, under negative-entropy mirror maps, with regret bounds matching static and dynamic lower bounds (Zhao et al., 2021, Gasnikov et al., 2014).

5. Regret, Statistical Risk, and High-Probability Bounds

OSMD guarantees optimal regret rates across a wide range of geometries and feedback models. For convex losses with bounded stochastic gradients: E[1Tt=1Tf(xt)f(x)]RGT\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T f(x_t) - f(x^*)\right] \leq \frac{R G}{\sqrt{T}} with high-probability versions via martingale techniques for sub-Gaussian or bounded-difference scenarios (Srebro et al., 2011, Gasnikov et al., 2014, Fang et al., 2020). For mirror maps normalized such that R2=supxKΦ(x)Φ(x0)R^2 = \sup_{x \in K} \Phi(x) - \Phi(x_0), one obtains dimension-dependent or -independent bounds: in the simplex, R2=lndR^2 = \ln d, so regret scales as O(lnd/T)O(\sqrt{\ln d / T}).

In stochastic constrained problems, the primal-dual OSMD framework yields both O(T)O(\sqrt{T}) regret and O(T)O(\sqrt{T}) constraint violation without requiring the Slater condition, and for the simplex, logarithmic dependence on dimension is retained (Wei et al., 2019).

6. Applications and Impact

OSMD is a central tool in:

  • Large-scale convex and structured optimization: Adaptive updates in 1/\ell_1/\ell_\infty settings, robust regression/classification, compressed sensing, and matrix completion (Lei et al., 2018, Gasnikov et al., 2014).
  • Stochastic programming: Single- and multi-stage settings, with linear-time asynchronous algorithms for scenario trees (Zhang et al., 18 Jun 2025).
  • Online learning and regret minimization: Expert advice, adversarial bandits, adaptive portfolio selection (Srebro et al., 2011, Zhao et al., 2021).
  • Federated learning and sampling: Adaptive client sampling, stochastic coordinate/mini-batch selection for distributed optimization (Zhao et al., 2021).
  • SDE parameter inference: Online estimation with non-i.i.d., biased, and dependent gradient structures (Nakakita, 2022).
  • Constraint handling: Primal-dual/variance-reduced extensions for equality and inequality constraints with stochastic or partial feedback (Wei et al., 2019, Fang et al., 2020).

The universality of OSMD is now mathematically formalized: for any learnable online convex setup, there is an appropriate mirror map such that OSMD achieves minimax-optimal rates up to logarithmic factors (Srebro et al., 2011). This covers both stochastic and adversarial, static and dynamic, bandit and full-information, and constrained regimes, subsuming a wide range of variance reduction and adaptive methods.

7. Proof Architecture and Core Lemmatic Structure

The convergence and regret analyses for OSMD are united by a one-step “Bregman progress” lemma: Ezt[DΦ(w,wt+1)]DΦ(w,wt)=ηtwwt,F(wt)+Ezt[DΦ(wt,wt+1)]\mathbb{E}_{z_t} [D_\Phi(w^*, w_{t+1})] - D_\Phi(w^*, w_t) = \eta_t \langle w^* - w_t, \nabla F(w_t) \rangle + \mathbb{E}_{z_t}[D_\Phi(w_t, w_{t+1})] This identity enables both necessity and sufficiency: telescoping and bounding the individual terms via the strong convexity and smoothness properties of the mirror map and the loss. In unconstrained problems, the analysis reduces to managing the noise and curvature trade-offs via step size. In constrained or dual settings, additional drift and penalty decomposition is employed, with dual-queue techniques to cap constraint violation growth (Lei et al., 2018, Fang et al., 2020, Wei et al., 2019).

High-probability and dynamic regret bounds are achieved by coupling this decomposition with martingale concentration, mixing-time, or pathwise variation control, depending on the regime. All variants preserve the mirror descent Bregman-potential structure as the central analytic invariant.


References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Online Stochastic Mirror Descent (OSMD).