Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts
Detailed Answer
Thorough responses based on abstracts and some paper content
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash
99 tokens/sec
GPT-4o
73 tokens/sec
Gemini 2.5 Pro Pro
67 tokens/sec
o3 Pro
18 tokens/sec
GPT-4.1 Pro
66 tokens/sec
DeepSeek R1 via Azure Pro
19 tokens/sec
2000 character limit reached

Zeroth-Order Optimization (ZO) in High Dimensions

Last updated: June 12, 2025

Below is a factually faithful, well-sourced, and polished article on Zeroth-Order Optimization °, strictly grounded in the contents of (Wang et al., 2017 ° ).


Zeroth-Order Optimization in High Dimensions: Algorithms, Guarantees, and Practical Applications

Zeroth-order optimization (ZO) techniques address optimization problems where only function evaluations °—not gradients—are accessible. This paradigm is vital in areas such as black-box machine learning hyperparameter tuning, experiment design, neural stimulation, and stochastic optimization over large search spaces ° where analytic gradients ° are unavailable or unreliable.

Optimizing in high dimensions poses fundamental challenges: the computational cost of naive ZO approaches scales linearly with the number of variables dd, leading to the so-called "curse of dimensionality." However, many practical problems are inherently sparse: only a small number sds \ll d of variables are truly influential. By exploiting such sparsity, it is possible to design ZO algorithms whose convergence rates depend only logarithmically on dd—making high-dimensional black-box optimization ° practically viable.

This article synthesizes the methodology, theory, and application of dimension-robust stochastic zeroth-order algorithms under sparsity assumptions, as proposed by Wang et al. in "Stochastic Zeroth-order Optimization in High Dimensions."


1. Algorithmic Frameworks

Two main algorithms are introduced to leverage sparsity in high-dimensional convex ZO problems:

A. Successive Component/Feature Selection

This approach identifies and focuses on a sparse set of active variables for optimization, using the following workflow:

  1. Sparse Gradient Estimation ° via Lasso: At each iteration, collect a batch of function values at random sign (Rademacher) perturbations of the current point. Use the Lasso (ℓ1-regularized regression) to fit a sparse gradient estimate:

(g^t,μ^t)=argming,μ1ni=1n(y~igziμ)2+λ(g1+μ)(\widehat{g}_t, \widehat{\mu}_t) = \arg\min_{g, \mu} \frac{1}{n} \sum_{i=1}^n (\tilde{y}_i - g^\top z_i - \mu)^2 + \lambda (\|g\|_1 + |\mu|)

where y~i=[f(xt+δzi)+ξi]/δ\tilde{y}_i = [f(x_t + \delta z_i) + \xi_i]/\delta.

  1. Support Selection: Choose variables with [g^t]iη|[\widehat{g}_t]_i| \geq \eta for some threshold η\eta to construct a candidate active set S^\widehat{S}.
  2. Low-Dimensional ZO Optimization °: Restrict further ZO search and function evaluations to S^\widehat{S} using any classical method (e.g., finite differences).

Pseudocode:

1
2
3
4
5
for t in range(s):  # up to maximum allowed sparsity s
    g_hat = lasso_gradient_estimate(x, T_prime)
    S_hat = S_hat_union_indices_with_large_gradient(g_hat, eta)
    x = zeroth_order_optimize_on_S_hat(f, x, S_hat, T_prime)
return x

B. Noisy Mirror Descent with Lasso (De-biased) Gradient Estimates

This approach generalizes the idea to stochastic mirror descent °—a versatile method for convex optimization:

  • At each iteration, use Lasso to estimate the gradient as before.
  • De-bias the gradient estimate to improve its accuracy:

g~t=g^t+1nZt(Y~tZtg^tμ^t1n)\tilde{g}_t = \widehat{g}_t + \frac{1}{n} Z_t^\top(\tilde{Y}_t - Z_t \widehat{g}_t - \widehat{\mu}_t \cdot \mathbf{1}_n)

xt+1=argminxX{ηg~t(xxt)+Δψ(x,xt)}x_{t+1} = \arg\min_{x \in \mathcal{X}} \left\{ \eta \tilde{g}_t^\top(x - x_t) + \Delta_\psi(x, x_t) \right\}

Pseudocode:

1
2
3
4
for t in range(T):
    g_hat = debiased_lasso_gradient_estimate(x, n)
    x = mirror_descent_update(x, g_hat, eta, psi)
return x


2. Convergence Rates and Dimensional Dependency

These algorithms achieve convergence rates logarithmic in the ambient dimension ° dd due to the sparsity-exploiting gradient estimation.

Component Selection Algorithm:

  • Regret bound:

RAS(T)B(σ2H2slogdT)1/4+O~(T1/3)R^S_{\mathcal{A}}(T) \lesssim B \left( \frac{\sigma^2 H^2 s \log d}{T} \right)^{1/4} + \widetilde{O}(T^{-1/3})

  • BB: bound on solution norm
  • ss: sparsity
  • HH: ℓ1-norm bound on gradient
  • σ\sigma: noise level

Noisy Mirror Descent (De-biased Lasso):

  • Cumulative regret:

RAC(T)ξσ,sBlogd((1+H)2sT)1/4+O~(T1/2)R^C_{\mathcal{A}}(T) \lesssim \xi_{\sigma, s} B \sqrt{\log d} \left(\frac{(1+H)^2 s}{T}\right)^{1/4} + \widetilde{O}(T^{-1/2})

  • Under Hessian smoothness, the rate further improves by incorporating higher-order bias corrections °.

Takeaway: The logd\log d factor replaces the typical dd in mean-squared error ° and regret bounds, making these algorithms practical in very high-dimensional settings ° so long as ss is modest.


3. Sparsity Assumptions

These gains rest on explicit sparsity structures:

  • Gradient Sparsity ° (A3): At every xx, the gradient satisfies f(x)0s\|\nabla f(x)\|_0 \leq s, f(x)1H\|\nabla f(x)\|_1 \leq H.
  • Function Sparsity (A5): The function depends only on a subset SS of variables with Ss|S|\leq s (stronger assumption).
  • Weak Hessian Sparsity (A4): The Hessian is sparse in the 1\ell_1 norm.

Under these, Lasso regression ° can accurately recover the support of the gradient—i.e., the truly "active" components—enabling the above efficiency.


4. Empirical Validation

Synthetic experiments ° demonstrate substantial practical gains:

  • Settings: Quadratic and fourth-degree sparse polynomials in d=100d=100, s=10,20s=10,20.
  • Baselines & Comparisons:
    • (1) Baseline ZO method (Flaxman et al. 2005)
    • (2) Lasso-GD (Component selection)
    • (3) Mirror Descent with de-biased Lasso (MD °)
  • Findings:
    • Both Lasso-GD and MD outperform standard ZO in cumulative regret, especially as dd increases.
    • MD is the most robust, less sensitive to parameter tuning.
    • Empirical regret curves confirm the theoretical logarithmic dependence on dd.

5. Practical Applications and Implications

Applications:

  • Black-box hyperparameter tuning: When only a handful of settings impact model ° accuracy.
  • Experimental and resource allocation design: Where relevant control variables are few.
  • Neuroscience/biomedical stimulus optimization: Where only select features trigger a response.
  • General black-box simulation optimization: If a sparse parameter subset truly matters.

Implications for Practice:

  • Exploiting sparsity structure allows high-dimensional ZO optimization with costs comparable to low-dimensional problems.
  • Can be an enabling technology in settings where high costs or access restrictions make traditional gradient-based optimization ° infeasible.

Limitations/Open Questions:

  • Theory currently assumes exact or strong sparsity; extension to approximate (1\ell_1-bounded) sparsity is open.
  • Extension beyond convexity (e.g., to non-convex settings) remains an active research frontier.

Summary Table: Key Contributions

Algorithm Key Idea Convergence Rate Dimension in Rate Sparsity Used Empirical Benefit
Successive Component Selection Identify & optimize sparse set O(logdT1/4)O(\sqrt{\log d}T^{-1/4}) logd\log d Strong (A5) Outperforms baseline
Mirror Descent + De-biased Lasso Sparse grad + mirror descent O(logds1/4T1/4)O(\sqrt{\log d}s^{1/4} T^{-1/4}), T1/3T^{-1/3} with A6 logd\log d Mod-Strong Most robust, fastest

Conclusion

This framework delivers the first dimension-robust, sparsity-exploiting convergence guarantees for zeroth-order optimization, using Lasso-based gradient estimation, thresholding, and mirror descent. The result is a set of deployable algorithms for practitioners facing high-dimensional black-box optimization °—with strong theoretical, empirical, and practical support under realistic sparsity assumptions. This advances the frontier of gradient-free learning and optimization in modern machine learning and scientific experimentation.


References:


For detailed algorithms, parameter selection, and proofs, consult the full text and appendices of the original paper.