Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zeroth-Order Optimization Kernels

Updated 8 March 2026
  • Zeroth-order optimization kernels are defined mathematical functions that enable gradient estimation using only function-value queries, bypassing direct derivatives.
  • They improve estimator accuracy through tailored moment conditions that balance bias and variance, optimizing performance in high-dimensional spaces.
  • Kernel designs, including high-order polynomial kernels and neural tangent kernels, ensure near-optimal sample complexity and query efficiency for diverse optimization problems.

Zeroth-order optimization (ZOO) kernels are mathematical objects used within black-box optimization algorithms that rely solely on function-value queries, rather than gradient information, to perform search and optimization in high-dimensional spaces. These kernels serve several interlocking roles: providing the mechanism for gradient estimation via randomized smoothing, encoding structural priors (e.g., neural network geometry via neural tangent kernels), and controlling bias-variance tradeoffs in the learning process. ZOO kernels are a fundamental bridge between higher-order smoothness assumptions and the algorithmic design of derivative-free methods, enabling nearly optimal sample complexity and query efficiency in a range of convex and nonconvex discrete and continuous settings (Hu et al., 2024, Lobanov et al., 2023, Akhavan et al., 2020).

1. Foundations of Zeroth-Order Optimization and Kernels

Standard ZOO problems are formulated by seeking the minimizer x∗∈Q⊂Rdx^* \in Q \subset \mathbb{R}^d of a function f:Rd→Rf : \mathbb{R}^d \rightarrow \mathbb{R}, given only a noisy oracle for function values f~(x)=f(x)+ξ\tilde{f}(x) = f(x) + \xi, where ξ\xi is a zero-mean noise term with finite variance. No direct access to ∇f\nabla f is provided.

To overcome the lack of gradients, ZOO algorithms construct randomized finite-difference or smoothing-based estimators from function values alone. Kernels are specifically designed functions—often polynomials or neural-inspired mappings—utilized in these estimators to annihilate specific bias terms (arising from Taylor expansion of ff) and, through their moment properties, extract unbiased (or low-bias) estimates for use in stochastic optimization schemes (Lobanov et al., 2023, Akhavan et al., 2020).

By leveraging higher-order Hölder smoothness assumptions (i.e., f∈Fβ(L)f \in \mathcal{F}_{\beta}(L) for β≥2\beta \geq 2), kernels are tailored to exploit the available structure to control both estimator bias and variance, thereby accelerating convergence relative to basic random search approaches.

2. Kernel Construction and Theoretical Properties

The central requirements for a ZOO kernel K:[−1,1]→RK : [-1,1] \rightarrow \mathbb{R} are encoded as moment conditions:

  • Er[K(r)]=0\mathbb{E}_r[K(r)] = 0
  • Er[rK(r)]=1\mathbb{E}_r[r K(r)] = 1
  • Er[rjK(r)]=0\mathbb{E}_r[r^j K(r)] = 0 for j=2,…,l=⌊β⌋−1j=2,\ldots,l=\lfloor\beta\rfloor-1
  • Finite moments: E[∣r∣β∣K(r)∣]<∞\mathbb{E}[|r|^\beta |K(r)|] < \infty and E[K(r)2]=κ<∞\mathbb{E}[K(r)^2] = \kappa < \infty

Such construction ensures lower-order Taylor bias terms vanish, so that the bias of the gradient estimator decays as O(hβ−1)O(h^{\beta-1}) for smoothing parameter h>0h>0. Suitable KK can be constructed explicitly using weighted combinations of Legendre polynomials; e.g., for β=5,6\beta=5,6,

K(r)=19516 r (99r4−126r2+35)K(r) = \frac{195}{16} \, r \, (99r^4 - 126r^2 + 35)

(Lobanov et al., 2023, Akhavan et al., 2020).

Alternatively, if the black-box has a known or assumed neural-network structure, the neural tangent kernel (NTK) is constructed as

k(z,z′)=∇θϕ(θ0,z)⊤∇θϕ(θ0,z′)k(z, z') = \nabla_{\theta} \phi(\theta_0, z)^\top \nabla_{\theta} \phi(\theta_0, z')

where ϕ\phi is a neural network used to approximate the response function and θ0\theta_0 is its initialization. In the infinite-width regime, ϕ\phi becomes a Gaussian Process with kernel kk, yielding a prior directly aligned with the local geometry of neural models (Hu et al., 2024).

3. Practical ZOO Algorithms Using Kernels

Smoothing-Kernel Gradient Estimation

Both (Lobanov et al., 2023) and (Akhavan et al., 2020) implement a two-point randomized estimator: at iterate xtx_t, generate rt∼Uniform[−1,1]r_t \sim \mathrm{Uniform}[-1,1], direction et∼Sd−1e_t \sim S^{d-1}, and use

g(xt;rt,et)=d2ht[f~(xt+htrtet)−f~(xt−htrtet)]K(rt)etg(x_t; r_t, e_t) = \frac{d}{2h_t} \left[\tilde{f}(x_t + h_t r_t e_t) - \tilde{f}(x_t - h_t r_t e_t)\right] K(r_t) e_t

This estimator satisfies

  • Bias: ∥E[g(xt;rt,et)]−∇f(xt)∥≤O(Lβhtβ−1)\| \mathbb{E}[g(x_t; r_t, e_t)] - \nabla f(x_t) \| \leq O(L_\beta h_t^{\beta-1})
  • Variance: E[∥g(xt;rt,et)∥2]≤4dκ∥∇f(xt)∥2+O(ht2)+O(d2κΔ2/ht2)\mathbb{E}[ \| g(x_t; r_t, e_t) \|^2 ] \leq 4d\kappa \| \nabla f(x_t) \|^2 + O(h_t^2) + O(d^2 \kappa \Delta^2/h_t^2)

Gradient estimators with these properties are then integrated into projected (accelerated) SGD updates or bandit-style projected gradient schemes (Lobanov et al., 2023, Akhavan et al., 2020).

Gaussian Process Surrogates with NTK

For neural-inspired black-boxes, ZOPO combines a local GP surrogate with the NTK, fitting it on a small local neighborhood of query points to minimize cubic complexity. The mean of this surrogate delivers an estimator closely matching the true gradient, and its predictive uncertainty is used to trigger additional local exploration when needed. The next iterate zt+1z_{t+1} is selected via a projected gradient update

zt+1=ProjZ(zt+ηtμt(zt))z_{t+1} = \mathrm{Proj}_{\mathcal{Z}}(z_t + \eta_t \mu_t(z_t))

where μt(zt)\mu_t(z_t) is the predicted mean gradient, and the projection enforces domain constraints. Local exploration is triggered adaptively whenever predictive uncertainty Σt(zt)\Sigma_t(z_t) is consistently large (Hu et al., 2024).

4. Performance Analysis and Empirical Results

Kernels that match the function class (e.g., NTK for neural black-boxes, high-order polynomial kernels for analytic smoothness) deliver substantially improved performance:

  • Smoothing-kernel ZOO methods achieve minimax-optimal rates up to a factor of dd:
    • Optimization error: O((d2/α)T−(β−1)/β)O((d^2/\alpha) T^{-(\beta-1)/\beta})
    • Sample complexity: T=O((d2/(αϵ))β/(β−1))T = O\left( (d^2/(\alpha \epsilon))^{\beta/(\beta-1)} \right) to reach ϵ\epsilon-accuracy (Akhavan et al., 2020).
  • ZO-AccSGD (Lobanov et al., 2023) uses kernel-based estimators with Nesterov acceleration, achieving iteration complexity N=O(ϵ−1/2)N = O(\epsilon^{-1/2}) (with sufficiently large batch size), and oracle query complexity T=O(d1/ϵ)T = O(d\sqrt{1/\epsilon}) when noise and smoothness are favorable.
  • In high-dimensional prompt optimization tasks, replacing RBF/Matérn kernels with NTK in GP surrogates allowed covering 17/20 best tasks (versus 4/20 with generic kernels), with within-5-point optimum-hit rate increasing from $0.35$ to $1.0$ (Hu et al., 2024).

These empirical results affirm that the kernel function is critical in high-dimensional ZOO, dictating both estimator fidelity and overall efficiency.

5. Algorithmic and Computational Considerations

Kernel choice fundamentally affects computational complexity:

Kernel Type Cost/Form Structure Efficiency Strategy
Polynomial kernel O(d)O(d) per estimator (Batch) stochastic parallelization
NTK (ZOPO) O(n2pd)O(n^2 p d) for kernel matrix Restrict GP to n≈20n \approx 20 neighbors
O(n3)O(n^3) for inversion Use small neural networks, regularize

ZOPO restricts GP fitting to a neighborhood of at most nn points, leverages a two-layer MLP to moderate parameter count, and regularizes the GP with a σ2I\sigma^2 I term for numerical stability (Hu et al., 2024). High-order polynomial kernels remain computationally attractive so long as their moment constraints match the function smoothness.

6. Kernel Selection Guidelines and Current Limitations

Kernel selection is guided by the following principles:

  • Match the function class: For neural black-boxes, NTK or neural-inspired kernels provide the relevant geometric inductive bias; for function classes with explicit Hölder/analytic smoothness, high-order moment-vanishing polynomial kernels are preferred (Hu et al., 2024, Lobanov et al., 2023, Akhavan et al., 2020).
  • Ensure kernels are at least twice differentiable for compatibility with derived-gradient GPs.
  • Prefer local surrogates to control O(n3)O(n^3) matrix costs in GP-based schemes.
  • Regularize GPs (e.g., with σ2I\sigma^2 I) and project iterates back to the trusted domain.
  • For black-boxes with unknown structure, composite kernels (e.g., NTK+RBF) may capture both global smoothness and local function geometry.

A plausible implication is that inappropriate kernel choice leads directly to a collapse in optimizer performance—e.g., generic smooth kernels for neural models yield dramatic losses in task coverage and convergence quality (Hu et al., 2024).

7. Minimax Optimality and Theory-Driven Design

Analyses in (Akhavan et al., 2020) and (Lobanov et al., 2023) establish that smoothing-kernel ZOO methods are, up to dimension-dependent constants, minimax-optimal for both optimization error and sample complexity in the presence of strong convexity and high-order smoothness. Lower bounds show that no algorithm accessing the function via sequential queries can improve significantly on these rates without further structural assumptions.

Furthermore, the use of high-order moment-vanishing kernels and bandit-style two-point estimators enables sharp bias-variance tradeoffs, providing O(d/T)O(d/\sqrt{T}) error rates for quadratic smoothness (β=2)(\beta=2) and interpolating smoothly to T∼O(d2/(αϵ))T \sim O(d^2/(\alpha \epsilon)) for analytic cases.

In Gaussian-process surrogate frameworks (e.g., ZOPO), leveraging a function-appropriate kernel (NTK) in the GP surrogate provides a nearly unbiased gradient surrogate, with uncertainty estimates guiding exploration and resulting in state-of-the-art query efficiency.


Zeroth-order optimization kernels are thus foundational to modern black-box and derivative-free learning, functioning as the principal controls for algorithmic bias, variance, computational tractability, sample complexity, and, ultimately, empirical success across domains that resist direct gradient computation. Their development and analysis exemplify the synthesis of optimization theory, machine learning, and practical algorithm design (Hu et al., 2024, Lobanov et al., 2023, Akhavan et al., 2020).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zeroth-Order Optimization Kernels.