Zeroth-Order Optimization (ZO) in High Dimensions
Last updated: June 12, 2025
Below is a factually faithful, well-sourced, and polished article on Zeroth-Order Optimization °, strictly grounded in the contents of (Wang et al., 2017 ° ).
Zeroth-Order Optimization in High Dimensions: Algorithms, Guarantees, and Practical Applications
Zeroth-order optimization (ZO) techniques address optimization problems where only function evaluations °—not gradients—are accessible. This paradigm is vital in areas such as black-box machine learning hyperparameter tuning, experiment design, neural stimulation, and stochastic optimization over large search spaces ° where analytic gradients ° are unavailable or unreliable.
Optimizing in high dimensions poses fundamental challenges: the computational cost of naive ZO approaches scales linearly with the number of variables , leading to the so-called "curse of dimensionality." However, many practical problems are inherently sparse: only a small number of variables are truly influential. By exploiting such sparsity, it is possible to design ZO algorithms whose convergence rates depend only logarithmically on —making high-dimensional black-box optimization ° practically viable.
This article synthesizes the methodology, theory, and application of dimension-robust stochastic zeroth-order algorithms under sparsity assumptions, as proposed by Wang et al. in "Stochastic Zeroth-order Optimization in High Dimensions."
1. Algorithmic Frameworks
Two main algorithms are introduced to leverage sparsity in high-dimensional convex ZO problems:
A. Successive Component/Feature Selection
This approach identifies and focuses on a sparse set of active variables for optimization, using the following workflow:
- Sparse Gradient Estimation ° via Lasso: At each iteration, collect a batch of function values at random sign (Rademacher) perturbations of the current point. Use the Lasso (ℓ1-regularized regression) to fit a sparse gradient estimate:
where .
- Support Selection: Choose variables with for some threshold to construct a candidate active set .
- Low-Dimensional ZO Optimization °: Restrict further ZO search and function evaluations to using any classical method (e.g., finite differences).
Pseudocode:
1 2 3 4 5 |
for t in range(s): # up to maximum allowed sparsity s g_hat = lasso_gradient_estimate(x, T_prime) S_hat = S_hat_union_indices_with_large_gradient(g_hat, eta) x = zeroth_order_optimize_on_S_hat(f, x, S_hat, T_prime) return x |
B. Noisy Mirror Descent with Lasso (De-biased) Gradient Estimates
This approach generalizes the idea to stochastic mirror descent °—a versatile method for convex optimization:
- At each iteration, use Lasso to estimate the gradient as before.
- De-bias the gradient estimate to improve its accuracy:
- Update the iterate using the mirror map with respect to a Bregman divergence ° :
- Optionally, further de-bias by combining gradient estimates ° at different smoothing scales for higher accuracy if the Hessian ° is smooth.
Pseudocode:
1 2 3 4 |
for t in range(T): g_hat = debiased_lasso_gradient_estimate(x, n) x = mirror_descent_update(x, g_hat, eta, psi) return x |
2. Convergence Rates and Dimensional Dependency
These algorithms achieve convergence rates logarithmic in the ambient dimension ° due to the sparsity-exploiting gradient estimation.
Component Selection Algorithm:
- Regret bound:
- : bound on solution norm
- : sparsity
- : ℓ1-norm bound on gradient
- : noise level
Noisy Mirror Descent (De-biased Lasso):
- Cumulative regret:
- Under Hessian smoothness, the rate further improves by incorporating higher-order bias corrections °.
Takeaway: The factor replaces the typical in mean-squared error ° and regret bounds, making these algorithms practical in very high-dimensional settings ° so long as is modest.
3. Sparsity Assumptions
These gains rest on explicit sparsity structures:
- Gradient Sparsity ° (A3): At every , the gradient satisfies , .
- Function Sparsity (A5): The function depends only on a subset of variables with (stronger assumption).
- Weak Hessian Sparsity (A4): The Hessian is sparse in the norm.
Under these, Lasso regression ° can accurately recover the support of the gradient—i.e., the truly "active" components—enabling the above efficiency.
4. Empirical Validation
Synthetic experiments ° demonstrate substantial practical gains:
- Settings: Quadratic and fourth-degree sparse polynomials in , .
- Baselines & Comparisons:
- (1) Baseline ZO method (Flaxman et al. 2005)
- (2) Lasso-GD (Component selection)
- (3) Mirror Descent with de-biased Lasso (MD °)
- Findings:
- Both Lasso-GD and MD outperform standard ZO in cumulative regret, especially as increases.
- MD is the most robust, less sensitive to parameter tuning.
- Empirical regret curves confirm the theoretical logarithmic dependence on .
5. Practical Applications and Implications
Applications:
- Black-box hyperparameter tuning: When only a handful of settings impact model ° accuracy.
- Experimental and resource allocation design: Where relevant control variables are few.
- Neuroscience/biomedical stimulus optimization: Where only select features trigger a response.
- General black-box simulation optimization: If a sparse parameter subset truly matters.
Implications for Practice:
- Exploiting sparsity structure allows high-dimensional ZO optimization with costs comparable to low-dimensional problems.
- Can be an enabling technology in settings where high costs or access restrictions make traditional gradient-based optimization ° infeasible.
Limitations/Open Questions:
- Theory currently assumes exact or strong sparsity; extension to approximate (-bounded) sparsity is open.
- Extension beyond convexity (e.g., to non-convex settings) remains an active research frontier.
Summary Table: Key Contributions
Algorithm | Key Idea | Convergence Rate | Dimension in Rate | Sparsity Used | Empirical Benefit |
---|---|---|---|---|---|
Successive Component Selection | Identify & optimize sparse set | Strong (A5) | Outperforms baseline | ||
Mirror Descent + De-biased Lasso | Sparse grad + mirror descent | , with A6 | Mod-Strong | Most robust, fastest |
Conclusion
This framework delivers the first dimension-robust, sparsity-exploiting convergence guarantees for zeroth-order optimization, using Lasso-based gradient estimation, thresholding, and mirror descent. The result is a set of deployable algorithms for practitioners facing high-dimensional black-box optimization °—with strong theoretical, empirical, and practical support under realistic sparsity assumptions. This advances the frontier of gradient-free learning and optimization in modern machine learning and scientific experimentation.
References:
- Flaxman et al., 2005
- Javanmard & Montanari, 2014
- [Nemirovski et al., 2009], Lan, 2012
For detailed algorithms, parameter selection, and proofs, consult the full text and appendices of the original paper.