Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zeroth-Order Mirror Descent

Updated 7 February 2026
  • Zeroth-Order Mirror Descent is a zero-order optimization approach that estimates gradients using noisy function evaluations with smoothing and finite-difference techniques.
  • It leverages mirror maps and Bregman divergences to iteratively update solutions under non-Euclidean geometries, achieving concrete oracle-based convergence guarantees.
  • The framework applies to high-dimensional, block-structured, and minimax problems, balancing bias-variance trade-offs to enable robust optimization in both convex and nonconvex settings.

Zeroth-Order Mirror Descent Framework

The Zeroth-Order Mirror Descent (ZOMD) framework generalizes classical mirror descent methods to settings where only function value—rather than gradient—information is available from potentially biased and noisy oracles. This paradigm enables efficient optimization over general convex (and, in some extensions, nonconvex or composite) objectives, including those with high-dimensional, block-structured, or distributionally-robust characteristics under non-Euclidean geometries. Zeroth-order mirror descent has attained prominence due to its concrete oracle-based convergence guarantees, flexibility in handling bias and structure, and robustness to the absence of differentiability in the objective or regularizer (Paul et al., 2023, Shao et al., 2022, Wang et al., 2017).

1. Problem Formulation and Oracle Structure

The canonical ZOMD setup involves the minimization of a convex or composite objective over a compact or structured set: minxXf(x)\min_{x \in \mathcal{X}} f(x) where f:RnRf:\mathbb{R}^n\to\mathbb{R} is typically convex or satisfies smoothness/regularity assumptions on XRn\mathcal{X}\subset\mathbb{R}^n (Paul et al., 2023, Shao et al., 2022, Gu et al., 2024, Yu et al., 2019).

The oracle provides only noisy, possibly biased function values: f^(x)=f(x)+e(x,ω),E[e(x,ω)]=b(x),b(x)B,\hat{f}(x) = f(x) + e(x,\omega), \qquad \mathbb{E}[e(x,\omega)] = b(x), \quad \|b(x)\|_* \leq B, with noise variance E[e(x,ω)2]V2\mathbb{E}[e(x,\omega)^2] \leq V^2. This structure applies in both convex and nonconvex (including minimax and block-structured) settings, facilitating general-purpose optimization in black-box, high-dimensional, or adversarial contexts (Paul et al., 2023, Gu et al., 2024, Yu et al., 2019).

2. Gradient Surrogates via Smoothing and Estimation

Smoothing a non-differentiable or noisy objective is achieved by convolution with a stochastic kernel, yielding a differentiable approximation: fμ(x)=Eu[f(x+μu)],f_\mu(x) = \mathbb{E}_{u}[f(x+\mu u)], with uu generated from either a Gaussian (N(0,I)N(0,I)), Rademacher ({±1}d\{\pm1\}^d), or spherical uniform distribution, depending on the geometry (Paul et al., 2023, Shao et al., 2022, Gu et al., 2024, Wang et al., 2017).

Gradient surrogates are computed by querying finite-difference values along random directions:

  • Gaussian/Uniform smoothing (two-point):

g~t=f^(xt+μut)f^(xt)μut.\tilde{g}_t = \frac{\hat{f}(x_t+\mu u_t) - \hat{f}(x_t)}{\mu}u_t.

  • Rademacher smoothing (two-point, mini-batched):

gt=1mνj=1m[f(xt+νut,j;ξt,j)f(xt;ξt,j)]ut,j,g_t = \frac{1}{m\nu} \sum_{j=1}^m [f(x_t+\nu u_{t,j};\xi_{t,j}) - f(x_t;\xi_{t,j})] u_{t,j},

with ut,j{±1}du_{t,j}\in\{\pm1\}^d and independent samples (Shao et al., 2022).

Under suitable bias and variance control, these estimators satisfy E[g~txt]=fμ(xt)+B(t)E[\tilde{g}_t|x_t] = \nabla f_\mu(x_t) + B(t), with B(t)B(t) determined by the bias term and smoothing parameter (Paul et al., 2023). In high dimensions or with structural assumptions (e.g., sparsity), Lasso-based de-biased estimators are used to maintain favorable scaling (Wang et al., 2017).

3. Mirror Map, Bregman Geometry, and Update Rule

Mirror descent is parameterized by a strongly convex "mirror map" RR or Ψ\Psi, with corresponding Bregman divergence: DR(x,y)=R(x)R(y)R(y),xy.D_R(x,y) = R(x) - R(y) - \langle \nabla R(y), x - y \rangle. The ZOMD update with stochastic step-size αt\alpha_t is: xt+1=argminxX{g~t,xxt+1αtDR(x,xt)}.x_{t+1} = \operatorname{argmin}_{x\in\mathcal{X}}\left\{\langle \tilde{g}_t, x - x_t \rangle + \frac{1}{\alpha_t} D_R(x,x_t)\right\}. In structured or block-coordinate variants, one defines block-separable distance generators and applies the (possibly blockwise) proximal mapping accordingly (Yu et al., 2019). For non-Euclidean domains, entropy-like potentials and \ell_\infty geometry are employed to leverage intrinsic dimensionality (Shao et al., 2022, Wang et al., 2017).

In minimax or composite problems, primal and dual variables each receive tailored potentials, e.g., an entropy potential on the probability simplex and a Euclidean or non-Euclidean map on the hypothesis space (Gu et al., 2024).

4. Convergence Principles and Finite-Time Guarantees

A central result is almost-sure convergence to a neighborhood of optimality for convex ff: lim suptf(zt)fδ+B1D,\limsup_{t\to\infty} f(z_t) - f^* \leq \delta + B_1 D, with δ\delta (smoothing bias) and B1B_1 (oracle bias contribution) explicitly tied to μ\mu, noise strength, and dimension (Paul et al., 2023). Finite-time concentration inequalities provide probabilistic bounds on deviations from the neighborhood after tt iterations, controlled by variance and step-size schedule (Paul et al., 2023, Gu et al., 2024).

In nonconvex settings, expected stationarity or generalized gradient mapping norms serve as the measure, with complexity scaling as O((lnd)/ϵ4)O((\ln d)/\epsilon^4) or O(bn/ϵ2)O(bn/\epsilon^2), reflecting the use of mini-batches, block sampling, or variance-reduction (Shao et al., 2022, Yu et al., 2019).

For structured problems (e.g., block coordinate or sparse high-dimensional), the use of random feature selection and Lasso-based debiasing under sparsity achieves convergence rates with only logarithmic dependence on the ambient dimension (Wang et al., 2017).

5. Deterministic and Advanced Variants

Recent research incorporates deterministic vector-field-driven mirror descent, replacing the stochastic surrogate with central finite difference schemes. The update is governed by: xj+1=(Φ)1(Φ(xj)ηjΩ(xj)),x_{j+1} = (\nabla \Phi)^{-1}(\nabla \Phi(x_j) - \eta_j \Omega(x_j)), where Ω\Omega is constructed deterministically using $2d+1$ function values per iteration. Trajectory-wise a posteriori certification provides verifiable last-iterate guarantees under relative-smoothness-type inequalities and punctured-neighborhood generalized star-convexity conditions (Hayashi, 31 Jan 2026). The error floor is explicitly resolution-dependent, and backtracking can be used to certify monotonic descent (Hayashi, 31 Jan 2026).

6. Specialized Extensions: Block, Composite, and Minimax Problems

The ZOMD framework supports block coordinate approaches and composite settings:

  • Block-coordinate and composite objectives: The domain is partitioned, and updates are made selectively per block, enabling scalable optimization for high-dimensional and separable regularized objectives. Complexity achieves O(bn/ϵ2)O(bn/\epsilon^2) for (ϵ,Λ)(\epsilon,\Lambda)-stationarity with a two-phase approach yielding high-probability bounds (Yu et al., 2019).
  • Distributionally robust and minimax programs: ZO-SMD is adapted to minimax excess risk optimization, updating both model and dual variables via separate mirror maps, with optimal O(1/t)O(1/\sqrt{t}) convergence of both excess risk estimation and minimax error in both smooth and nonsmooth regimes (Gu et al., 2024).

7. Parameter Tuning, Bias–Variance Trade-off, and Practical Considerations

Parameter choices for step sizes (αt\alpha_t, ηt\eta_t), smoothing (μ\mu), and batch size are critical for achieving the explicit bias-variance trade-off. Key principles:

  • Smoothing parameter μ\mu determines the bias (smoothing error O(μ)O(\mu) or O(μ2)O(\mu^2)) versus the variance (inflated as O(1/μ2)O(1/\mu^2) as μ0\mu\to0).
  • For unbiased oracles, one may schedule μ0\mu\to0 slowly to obtain vanishing neighborhoods; under nonzero bias, the limiting error is minimized at a balancing value of μ\mu (Paul et al., 2023).
  • Step-size schedules must satisfy αt=\sum \alpha_t = \infty, αt2<\sum \alpha_t^2 < \infty. Commonly, αt=1/t\alpha_t=1/t is effective (Paul et al., 2023).
  • In nonconvex and high-dimensional settings, mini-batching and feature selection reduce variance and computational cost without degraded rates (Shao et al., 2022, Wang et al., 2017).
  • Adaptive step sizes can obviate the need for Lipschitz constants (Shao et al., 2022).

The practical implementation requires only function value oracles and is robust to noise and structural heterogeneity of the objective or constraint geometry.


References:

  • "Robust Analysis of Almost Sure Convergence of Zeroth-Order Mirror Descent Algorithm" (Paul et al., 2023)
  • "Adaptive Zeroth-Order Optimisation of Nonconvex Composite Objectives" (Shao et al., 2022)
  • "Stochastic Zeroth-order Optimization in High Dimensions" (Wang et al., 2017)
  • "Zeroth-Order Stochastic Mirror Descent Algorithms for Minimax Excess Risk Optimization" (Gu et al., 2024)
  • "Deterministic Zeroth-Order Mirror Descent via Vector Fields with A Posteriori Certification" (Hayashi, 31 Jan 2026)
  • "Zeroth-Order Stochastic Block Coordinate Type Methods for Nonconvex Optimization" (Yu et al., 2019)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zeroth-Order Mirror Descent Framework.