Online Stochastic Mirror Descent (OSMD)

Updated 27 February 2026

Online Stochastic Mirror Descent is a versatile optimization method that uses mirror maps and Bregman divergence to adapt to non-Euclidean domains in online and stochastic settings.
It applies iterative updates in both dual and primal forms, recovering multiplicative weights on the simplex and projected SGD in Euclidean spaces.
It offers rigorous convergence and regret guarantees under diverse gradient conditions, proving useful in online learning, federated learning, and multi-stage stochastic programming.

Online Stochastic Mirror Descent (OSMD) is a general-purpose first-order method for online and stochastic convex optimization in arbitrary geometries. It exploits problem structure via a mirror map and Bregman divergence, producing updates that adapt to non-Euclidean domains, constraints, and statistical settings. OSMD is foundational in online learning, stochastic approximation, multi-stage stochastic programming, federated learning, and beyond. Its convergence and regret guarantees are now tightly unified across diverse regimes, including noisy/biased/adversarial gradients and complex constraint architectures.

1. Algorithmic Foundations and Update Structure

The OSMD framework operates over a convex set $W \subset \mathbb{R}^d$ , possibly equipped with a non-Euclidean norm. A Legendre-type mirror map $\Phi: W \to \mathbb{R}$ , assumed to be Fréchet differentiable and $\sigma_\Phi$ -strongly convex, induces a Bregman divergence: $D_\Phi(u,v) := \Phi(u) - \Phi(v) - \langle \nabla\Phi(v), u - v \rangle \geq \frac{\sigma_\Phi}{2} \|u-v\|^2.$ Given i.i.d. data $z_1, z_2, \ldots$ and convex per-sample loss $\ell(w;z)$ , OSMD iteratively computes

$\text{(Dual form):}\quad \nabla\Phi(w_{t+1}) = \nabla\Phi(w_t) - \eta_t \nabla_w \ell(w_t; z_t),$

or equivalently,

$\text{(Primal form):}\quad w_{t+1} = \arg\min_{w \in W} \left\{ \langle \nabla_w\ell(w_t;z_t), w - w_t \rangle + \frac{1}{\eta_t} D_\Phi(w, w_t) \right\}.$

When applied on the simplex with the negative-entropy mirror map, this recovers the multiplicative weights update; for the Euclidean geometry, it reduces to projected SGD. The step size $\eta_t$ governs the learning dynamics, with convergence determined by its asymptotics (Lei et al., 2018, Srebro et al., 2011, Gasnikov et al., 2014).

2. Convergence Theory: Conditions and Rates

Required Step-Size Regimes

Positive-variance (noisy gradients): Convergence in expectation requires $\lim_{t \to \infty} \eta_t = 0$ and $\sum_{t=1}^\infty \eta_t = \infty$ .
Zero-variance (oracle/noiseless/empirical risk minimization): The condition reduces to $\sum_{t=1}^{\infty} \eta_t = \infty$ ; constant $\eta_t$ are admissible under strong convexity.

Convergence Guarantees

Let $w^*$ minimize the population risk $F(w) = \mathbb{E}[\ell(w;Z)]$ , with $\ell(\cdot;z)$ $L$ -smooth and $F, \Phi$ satisfying a local convexity-control condition. Then:

Positive-variance regime: Under mild assumptions ( $\inf_w \mathbb{E}[\|\nabla \ell(w;Z)\|_*] > 0$ ), necessary and sufficient conditions on step size as above guarantee

$\mathbb{E}[D_\Phi(w^*, w_T)] \to 0.$

In general (no strong convexity), no rate faster than $O(1/T)$ is possible. With strong convexity, selecting $\eta_t = 4 / [(t+1)\sigma_F]$ yields the minimax $O(1/T)$ rate.

Zero-variance regime: Assumes $\mathbb{E}[\|\nabla \ell(w^*;Z)\|_*] = 0$ . Constant step size gives linear convergence up to multiplicative constants:

$c_1^T D_\Phi(w^*, w_1) \leq \mathbb{E}[D_\Phi(w^*, w_T)] \leq c_2^T D_\Phi(w^*, w_1).$

Almost sure convergence: If $\sum_{t=1}^\infty \eta_t = \infty$ , $\sum_{t=1}^{\infty} \eta_t^2 < \infty$ , and problem conditions hold, then $D_\Phi(w^*, w_t) \to 0$ almost surely (Lei et al., 2018).
General regret minimization: For regret relative to $w^*$ , optimal rates $O(\sqrt{T})$ or $O(1/T)$ are achieved depending on convexity structure (Srebro et al., 2011, Gasnikov et al., 2014).

3. Geometry, Mirror Maps, and Domain Adaptation

The essence of OSMD is the selection of a mirror map $\Phi$ capturing the geometry of the feasible set or prior knowledge:

Mirror Map	Domain	Bregman Divergence	Strong Convexity
$\frac{1}{2}\\|w\\|_2^2$	$\mathbb{R}^d$	$\frac{1}{2}\\|u-v\\|_2^2$	1
$\sum_{i=1}^d w_i \log w_i$	Probability simplex	$\sum_{i=1}^d u_i \ln(u_i/v_i)$ (KL)	1 (in $\ell_1$ -norm)
$\frac{1}{2}\\|w\\|_p^2$ , $1<p\leq 2$	$\mathbb{R}^d$	---	$\sigma_\Phi \asymp p-1$

Strong convexity and smoothness properties of $\Phi$ control stability and rates. The choice of $\Phi$ allows OSMD to natively handle sparse policies, simplex constraints, or adaptivity to group-norm, trace-norm, or Mahalanobis geometries (Gasnikov et al., 2014, Srebro et al., 2011, Lei et al., 2018).

4. Algorithmic and Statistical Extensions

Biased and Dependent Gradients

In structured problems (e.g., SDE parameter estimation, Markov data), gradients are typically biased and temporally dependent. The OSMD framework extends with minor modifications: $\theta_{i+1} = \arg\min_{\theta \in \Theta} \left\{ \langle K_{i,n}(\theta_i), \theta \rangle + \frac{1}{\eta_i} D_\psi(\theta, \theta_i) \right\},$ where $K_{i,n}$ are biased proxies for true gradients (Nakakita, 2022). Under suitable mixing and moment control,

$\sup_{(a,b)\in S_\varpi, \theta\in\Theta} \mathbb{E}[f^{a,b}(\bar{\theta}_n) - f^{a,b}(\theta)] \leq O\left(\frac{\log(n h_n^2)}{\sqrt{n h_n^2}} + h_n^{\beta/2}\right),$

in SDE drift estimation, with analogous results for diffusion coefficients.

Multi-Stage and Asynchronous Settings

In multi-stage stochastic programming, OSMD operates on a filtered probability space, producing stage-wise feasible decisions by interacting with “stochastic conditional gradient oracles”: $X_{t}^{(l+1)}(w) = \arg\min_{x_t \in X_t} \left\langle \gamma_l \tilde{G}_t^{(l)}(w), x_t \right\rangle + D_{v_t}(x_t, X_t^{(l)}(w))$ with oracles controlling bias/variance and respecting non-anticipativity constraints. An asynchronous “semi-online” protocol reduces oracle complexity from exponential to linear in the number of stages (Zhang et al., 18 Jun 2025).

Optimistic, Adaptive, and Bandit Variants

Optimistic OSMD: Incorporates predictions of next gradients to control regret via cumulative stochastic variance $\sigma_{1:T}^2$ and adversarial variation $\Sigma_{1:T}^2$ . For convex smooth losses:

$\mathbb{E}[\text{Reg}_T] = O(D \sqrt{\sigma_{1:T}^{2}} + D \sqrt{\Sigma_{1:T}^{2}})$

interpolating between purely stochastic and adversarial regimes (Chen et al., 2023).

Bandit/importance sampling regimes: OSMD is applied for online adaptive variance reduction (e.g., federated client sampling, coordinate descent) where only partial feedback is revealed. Updates are performed on restricted simplices, under negative-entropy mirror maps, with regret bounds matching static and dynamic lower bounds (Zhao et al., 2021, Gasnikov et al., 2014).

5. Regret, Statistical Risk, and High-Probability Bounds

OSMD guarantees optimal regret rates across a wide range of geometries and feedback models. For convex losses with bounded stochastic gradients: $\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T f(x_t) - f(x^*)\right] \leq \frac{R G}{\sqrt{T}}$ with high-probability versions via martingale techniques for sub-Gaussian or bounded-difference scenarios (Srebro et al., 2011, Gasnikov et al., 2014, Fang et al., 2020). For mirror maps normalized such that $R^2 = \sup_{x \in K} \Phi(x) - \Phi(x_0)$ , one obtains dimension-dependent or -independent bounds: in the simplex, $R^2 = \ln d$ , so regret scales as $O(\sqrt{\ln d / T})$ .

In stochastic constrained problems, the primal-dual OSMD framework yields both $O(\sqrt{T})$ regret and $O(\sqrt{T})$ constraint violation without requiring the Slater condition, and for the simplex, logarithmic dependence on dimension is retained (Wei et al., 2019).

6. Applications and Impact

OSMD is a central tool in:

Large-scale convex and structured optimization: Adaptive updates in $\ell_1/\ell_\infty$ settings, robust regression/classification, compressed sensing, and matrix completion (Lei et al., 2018, Gasnikov et al., 2014).
Stochastic programming: Single- and multi-stage settings, with linear-time asynchronous algorithms for scenario trees (Zhang et al., 18 Jun 2025).
Online learning and regret minimization: Expert advice, adversarial bandits, adaptive portfolio selection (Srebro et al., 2011, Zhao et al., 2021).
Federated learning and sampling: Adaptive client sampling, stochastic coordinate/mini-batch selection for distributed optimization (Zhao et al., 2021).
SDE parameter inference: Online estimation with non-i.i.d., biased, and dependent gradient structures (Nakakita, 2022).
Constraint handling: Primal-dual/variance-reduced extensions for equality and inequality constraints with stochastic or partial feedback (Wei et al., 2019, Fang et al., 2020).

The universality of OSMD is now mathematically formalized: for any learnable online convex setup, there is an appropriate mirror map such that OSMD achieves minimax-optimal rates up to logarithmic factors (Srebro et al., 2011). This covers both stochastic and adversarial, static and dynamic, bandit and full-information, and constrained regimes, subsuming a wide range of variance reduction and adaptive methods.

7. Proof Architecture and Core Lemmatic Structure

The convergence and regret analyses for OSMD are united by a one-step “Bregman progress” lemma: $\mathbb{E}_{z_t} [D_\Phi(w^*, w_{t+1})] - D_\Phi(w^*, w_t) = \eta_t \langle w^* - w_t, \nabla F(w_t) \rangle + \mathbb{E}_{z_t}[D_\Phi(w_t, w_{t+1})]$ This identity enables both necessity and sufficiency: telescoping and bounding the individual terms via the strong convexity and smoothness properties of the mirror map and the loss. In unconstrained problems, the analysis reduces to managing the noise and curvature trade-offs via step size. In constrained or dual settings, additional drift and penalty decomposition is employed, with dual-queue techniques to cap constraint violation growth (Lei et al., 2018, Fang et al., 2020, Wei et al., 2019).

High-probability and dynamic regret bounds are achieved by coupling this decomposition with martingale concentration, mixing-time, or pathwise variation control, depending on the regime. All variants preserve the mirror descent Bregman-potential structure as the central analytic invariant.

References

Convergence of Online Mirror Descent (Lei et al., 2018)
On the Universality of Online Mirror Descent (Srebro et al., 2011)
Efficient randomized mirror descents in stochastic online convex optimization (Gasnikov et al., 2014)
Optimistic Online Mirror Descent for Bridging Stochastic and Adversarial Online Convex Optimization (Chen et al., 2023)
Parametric estimation of stochastic differential equations via online gradient descent (Nakakita, 2022)
Adaptive Client Sampling in Federated Learning via Online Learning with Bandit Feedback (Zhao et al., 2021)
Efficient Online Mirror Descent Stochastic Approximation for Multi-Stage Stochastic Programming (Zhang et al., 18 Jun 2025)
Online mirror descent and dual averaging: keeping pace in the dynamic case (Fang et al., 2020)
Online Primal-Dual Mirror Descent under Stochastic Constraints (Wei et al., 2019)