Papers
Topics
Authors
Recent
2000 character limit reached

RESZO: Regression-Based Single-Point ZO

Updated 27 November 2025
  • Regression-Based Single-Point Zeroth-Order Optimization (RESZO) is a derivative-free method that uses regression on historical function evaluations to construct surrogate models and estimate gradients with reduced variance.
  • It employs both linear and quadratic surrogate models to capture gradient and curvature information, achieving convergence rates similar to two-point methods while requiring only one function query per iteration.
  • RESZO is particularly effective in online, black-box, and simulation-driven scenarios where obtaining multiple function evaluations is impractical or costly.

Regression-Based Single-Point Zeroth-Order Optimization (RESZO) is a class of derivative-free optimization algorithms designed for settings where only a single function evaluation is feasible at each iteration, such as online, black-box, and simulation-driven optimization. The key innovation of RESZO is the use of regression over multiple historical function evaluations to construct local surrogate models, whose gradients serve as low-variance descent directions. This approach achieves convergence rates and query complexities comparable to two-point zeroth-order methods while maintaining the practical and statistical efficiency of single-point evaluations (Chen et al., 6 Jul 2025).

1. Core Principles and Algorithmic Framework

Traditional single-point zeroth-order (SZO) methods estimate gradients using a single sample, e.g., gt=dδf(xt+δut)utg_t = \frac{d}{\delta} f(x_t + \delta u_t) u_t for utu_t drawn from a sphere or normal distribution, discarding all previous information. This produces high-variance estimates, leading to slow convergence needing O(d3/2/ε3/2)O(d^{3/2}/\varepsilon^{3/2}) queries to reach stationarity for smooth nonconvex objectives. In contrast, RESZO reuses the most recent mm function evaluations to fit a local surrogate model by least-squares regression, then takes the surrogate’s gradient as a descent direction. By aggregating historical information, both variance and bias are controlled, accelerating convergence with only one new function evaluation per step.

There are two principal RESZO variants:

  • Linear RESZO (L-RESZO): Fits a local linear surrogate around the current perturbed point using mm recent samples.
  • Quadratic RESZO (Q-RESZO): Fits a local quadratic surrogate with a diagonal Hessian to capture basic curvature information.

At each iteration tt:

  1. Sample utUnif(Sd1)u_t \sim \text{Unif}(S_{d-1}) or N(0,I)\mathcal{N}(0, I) and set x^t=xt+δut\hat x_t = x_t + \delta u_t.
  2. Query f(x^t)f(\hat x_t).
  3. Fit a surrogate function fts(x)f_t^s(x) using (x^ti,f(x^ti))i=0m1(\hat x_{t-i}, f(\hat x_{t-i}))_{i=0}^{m-1} via least-squares regression.
  4. Update xt+1=xtηfts(xt)x_{t+1} = x_t - \eta \nabla f_t^s(x_t).

This regression strategy allows RESZO to leverage the information content of multiple, costly function queries for each update, closing the gap to multi-query (two-point) methods (Chen et al., 6 Jul 2025).

2. Surrogate Model Construction and Algorithmic Implementation

The surrogate at time tt is built from mm perturbed points and corresponding function values.

  • Linear surrogate:

fts(x)=f(x^t)+gt(xx^t)+ctf_t^s(x) = f(\hat x_t) + g_t^\top(x - \hat x_t) + c_t The coefficient gtg_t (gradient estimate) and offset ctc_t are given by the least-squares solution:

Xt=[Δxt,1    1;;Δxt,m    1]Rm×(d+1) yt=[Δft,1;;Δft,m] [gt;ct]=(XtXt)Xtyt\begin{aligned} X_t &= [\Delta x_{t,1}^\top\;\; 1; \ldots; \Delta x_{t,m}^\top\;\; 1] \in \mathbb{R}^{m \times (d+1)} \ y_t &= [\Delta f_{t,1};\ldots;\Delta f_{t,m}]\ [g_t; c_t] &= (X_t^\top X_t)^\dagger X_t^\top y_t \end{aligned}

where Δxt,i=x^ti+1x^t\Delta x_{t,i} = \hat x_{t-i+1} - \hat x_t, Δft,i=f(x^ti+1)f(x^t)\Delta f_{t,i} = f(\hat x_{t-i+1}) - f(\hat x_t).

  • Quadratic surrogate:

Fits a diagonal-Hessian quadratic form,

fts(x)=f(x^t)+gt(xx^t)+12(xx^t)diag(ht)(xx^t)+ct,f_t^s(x) = f(\hat x_t) + g_t^\top(x - \hat x_t) + \frac{1}{2} (x-\hat x_t)^\top \operatorname{diag}(h_t) (x-\hat x_t) + c_t,

with regression matrices extended accordingly.

Pseudocode for L-RESZO:

1
2
3
4
5
6
7
8
9
Input: initial x₀, smoothing δ, stepsize η, window m, total T
For t = 0 to m−1:
    Run a standard one-point or residual-feedback SZO update
For t = m to T−1:
    Sample u_t; set hat_x_t = x_t + δu_t; query y_t = f(hat_x_t)
    Collect past m points (hat_x_{t−i}, y_{t−i}), i=0,...,m−1
    Construct X_t, y_t matrices as above
    Solve [g_t; c_t] = (X_tᵗX_t)† X_tᵗ y_t
    Update x_{t+1} = x_t − η g_t
Q-RESZO follows the same structure with the regression matrix ZtZ_t augmented by squared terms for diagonal curvature.

3. Theoretical Guarantees and Convergence Analysis

Under standard smoothness assumptions, the regression-based gradient gtg_t approximates f(x^t)\nabla f(\hat x_t) with error controlled by the window size mm, the step-size η\eta, and the geometry of the perturbations. Theoretical results for L-RESZO include:

  • Gradient-Error Control:

Under LL-smoothness, there exists Cd>0C_d>0 such that for all tmt \ge m,

gtf(x^t)CdL2maxi=1,,m1x^tix^t\|g_t - \nabla f(\hat x_t)\| \leq C_d \frac{L}{2} \max_{i=1,\ldots,m-1} \|\hat x_{t-i} - \hat x_t\|

for a dimension- and schedule-dependent CdC_d.

  • Smooth Nonconvex Case:

For η=Θ(1/(dCdL))\eta = \Theta(1/(d C_d L)) and T>mT>m,

1Tt=mT1f(xt)2O(dCdLf(xm)fT).\frac{1}{T} \sum_{t=m}^{T-1} \|\nabla f(x_t)\|^2 \leq O\left(d C_d L \frac{f(x_m) - f^*}{T}\right).

  • Strongly Convex Case:

For smooth μ\mu-strongly convex objectives,

f(xT)f(1Θ(μ/(dCdL)))Tm(f(xm)f)f(x_T) - f^* \leq \left(1 - \Theta(\mu/(d C_d L))\right)^{T-m} (f(x_m) - f^*)

  • Query Complexity:

| Setting | Two-point ZO | L-RESZO | |------------------------------|--------------|---------------| | Smooth nonconvex | O(d/ε)O(d/\varepsilon) | O(Cdd/ε)O(C_d d/\varepsilon) | | Smooth μ\mu-strongly convex | O((d/μ)log(1/ε))O((d/\mu)\log(1/\varepsilon)) | O((Cdd/μ)log(1/ε))O((C_d d/\mu)\log(1/\varepsilon)) |

Empirically, CdC_d behaves as O(d)O(\sqrt{d}). This suggests that in high dimensions, L-RESZO achieves query complexity comparable (up to a moderate factor) to two-point ZO methods, outperforming standard SZO by a significant margin (Chen et al., 6 Jul 2025).

4. Empirical Performance and Practical Considerations

Comprehensive experiments on noiseless ridge regression, logistic regression, Rosenbrock, and neural network training with d=100d=100–$200$ confirm that both L-RESZO and Q-RESZO converge at essentially the same iteration-rate as two-point ZO, while using only one query per step. Thus, in terms of function query complexity, RESZO is approximately twice as efficient. Both RESZO variants also substantially outperform residual-feedback SZO. Q-RESZO demonstrates slightly faster convergence than L-RESZO due to access to basic curvature information.

Stability and precision are sensitive to the perturbation radius δ\delta:

  • δ=0\delta=0 causes oscillations or divergence.
  • Small, positive δ\delta increases precision but can hurt stability if too small.
  • Adapting δt=ηgt1\delta_t = \eta\|g_{t-1}\| provides a balance between stability and optimality.
  • Window size: mdm \geq d is necessary for full-rank surrogate fitting; md+10m \approx d+10 is used in practice.
  • Overhead: Each iteration requires an m×(d+1)m \times (d+1) or m×(2d+1)m \times (2d+1) least-squares regression, which can be efficiently updated via rank-one matrix updates.

5. Advantages, Limitations, and Comparison

Advantages

  • Single function query per step with far superior variance and convergence properties than classic one-point estimators.
  • Systematic reuse of historical data: All available function calls are utilized for each gradient estimation.
  • Rates matching two-point ZO: Up to a moderate, empirically mild factor (Cd=O(d)C_d=O(\sqrt{d})).

Limitations

  • Assumption A2 dependence: The full theoretical guarantee requires that regression error, as encapsulated by CdC_d, remains bounded—a property empirically observed, but without sharp theoretical bounds for large dd.
  • Noiseless analysis: Current convergence results apply only in deterministic function settings.
  • Storage and batch-size: Maintaining a buffer of at least dd past queries is necessary for surrogate regression.

Comparison with other ZO methods

Method Queries per step Uses history Query complexity (nonconvex)
Classic SZO 1 No O(d3/2/ε3/2)O(d^{3/2}/\varepsilon^{3/2})
Two-point ZO 2 Not required O(d/ε)O(d/\varepsilon)
Residual-feedback SZO 1 Previous eval only Improved, but not regression-based
RESZO (proposed) 1 Yes (window mm) O(Cdd/ε)O(C_d d/\varepsilon)

6. Applications and Extensions

RESZO is particularly advantageous in settings where only single function queries are feasible at each iteration:

  • Online and dynamic optimization, where the objective may change over time and repeated querying is impossible.
  • Bandit settings, expensive simulation, and hyperparameter tuning.
  • Reinforcement learning and power systems control, where function evaluation is costly or resource-limited.
  • Safety-critical control systems, where repeated, identical actions are not permissible.

A plausible implication is the applicability of RESZO to reinforcement learning and simulation-based policy optimization under severe query limitations.

7. Open Problems and Future Directions

Although RESZO marks a substantial advance for single-point ZO, several technical challenges remain:

  • Extending theory to noisy evaluations (stochastic objectives).
  • Developing high-probability regret/convergence bounds.
  • Rigorous bounding of the regression constant CdC_d for high-dimensional regimes.
  • Improving adaptive strategies for window size and perturbation radius.
  • Incorporating variance reduction and acceleration mechanisms.

Potential extensions may include mirror-descent variants, non-Euclidean sampling schemes, or combination with control-oriented feedback designs (Chen et al., 6 Jul 2025).


For the definitive introduction, formal algorithmic details, theoretical analysis, and empirical comparisons, see "Regression-Based Single-Point Zeroth-Order Optimization" (Chen et al., 6 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Regression-Based Single-Point ZO (RESZO).