RESZO: Regression-Based Single-Point ZO

Updated 27 November 2025

Regression-Based Single-Point Zeroth-Order Optimization (RESZO) is a derivative-free method that uses regression on historical function evaluations to construct surrogate models and estimate gradients with reduced variance.
It employs both linear and quadratic surrogate models to capture gradient and curvature information, achieving convergence rates similar to two-point methods while requiring only one function query per iteration.
RESZO is particularly effective in online, black-box, and simulation-driven scenarios where obtaining multiple function evaluations is impractical or costly.

Regression-Based Single-Point Zeroth-Order Optimization (RESZO) is a class of derivative-free optimization algorithms designed for settings where only a single function evaluation is feasible at each iteration, such as online, black-box, and simulation-driven optimization. The key innovation of RESZO is the use of regression over multiple historical function evaluations to construct local surrogate models, whose gradients serve as low-variance descent directions. This approach achieves convergence rates and query complexities comparable to two-point zeroth-order methods while maintaining the practical and statistical efficiency of single-point evaluations (Chen et al., 6 Jul 2025).

1. Core Principles and Algorithmic Framework

Traditional single-point zeroth-order (SZO) methods estimate gradients using a single sample, e.g., $g_t = \frac{d}{\delta} f(x_t + \delta u_t) u_t$ for $u_t$ drawn from a sphere or normal distribution, discarding all previous information. This produces high-variance estimates, leading to slow convergence needing $O(d^{3/2}/\varepsilon^{3/2})$ queries to reach stationarity for smooth nonconvex objectives. In contrast, RESZO reuses the most recent $m$ function evaluations to fit a local surrogate model by least-squares regression, then takes the surrogate’s gradient as a descent direction. By aggregating historical information, both variance and bias are controlled, accelerating convergence with only one new function evaluation per step.

There are two principal RESZO variants:

Linear RESZO (L-RESZO): Fits a local linear surrogate around the current perturbed point using $m$ recent samples.
Quadratic RESZO (Q-RESZO): Fits a local quadratic surrogate with a diagonal Hessian to capture basic curvature information.

At each iteration $t$ :

Sample $u_t \sim \text{Unif}(S_{d-1})$ or $\mathcal{N}(0, I)$ and set $\hat x_t = x_t + \delta u_t$ .
Query $f(\hat x_t)$ .
Fit a surrogate function $f_t^s(x)$ using $(\hat x_{t-i}, f(\hat x_{t-i}))_{i=0}^{m-1}$ via least-squares regression.
Update $x_{t+1} = x_t - \eta \nabla f_t^s(x_t)$ .

This regression strategy allows RESZO to leverage the information content of multiple, costly function queries for each update, closing the gap to multi-query (two-point) methods (Chen et al., 6 Jul 2025).

2. Surrogate Model Construction and Algorithmic Implementation

The surrogate at time $t$ is built from $m$ perturbed points and corresponding function values.

Linear surrogate:

$f_t^s(x) = f(\hat x_t) + g_t^\top(x - \hat x_t) + c_t$ The coefficient $g_t$ (gradient estimate) and offset $c_t$ are given by the least-squares solution:

$\begin{aligned} X_t &= [\Delta x_{t,1}^\top\;\; 1; \ldots; \Delta x_{t,m}^\top\;\; 1] \in \mathbb{R}^{m \times (d+1)} \ y_t &= [\Delta f_{t,1};\ldots;\Delta f_{t,m}]\ [g_t; c_t] &= (X_t^\top X_t)^\dagger X_t^\top y_t \end{aligned}$

where $\Delta x_{t,i} = \hat x_{t-i+1} - \hat x_t$ , $\Delta f_{t,i} = f(\hat x_{t-i+1}) - f(\hat x_t)$ .

Quadratic surrogate:

Fits a diagonal-Hessian quadratic form,

$f_t^s(x) = f(\hat x_t) + g_t^\top(x - \hat x_t) + \frac{1}{2} (x-\hat x_t)^\top \operatorname{diag}(h_t) (x-\hat x_t) + c_t,$

with regression matrices extended accordingly.

Pseudocode for L-RESZO:

Input: initial x₀, smoothing δ, stepsize η, window m, total T
For t = 0 to m−1:
    Run a standard one-point or residual-feedback SZO update
For t = m to T−1:
    Sample u_t; set hat_x_t = x_t + δu_t; query y_t = f(hat_x_t)
    Collect past m points (hat_x_{t−i}, y_{t−i}), i=0,...,m−1
    Construct X_t, y_t matrices as above
    Solve [g_t; c_t] = (X_tᵗX_t)† X_tᵗ y_t
    Update x_{t+1} = x_t − η g_t

Q-RESZO follows the same structure with the regression matrix

Z_t

augmented by squared terms for diagonal curvature.

3. Theoretical Guarantees and Convergence Analysis

Under standard smoothness assumptions, the regression-based gradient $g_t$ approximates $\nabla f(\hat x_t)$ with error controlled by the window size $m$ , the step-size $\eta$ , and the geometry of the perturbations. Theoretical results for L-RESZO include:

Gradient-Error Control:

Under $L$ -smoothness, there exists $C_d>0$ such that for all $t \ge m$ ,

$\|g_t - \nabla f(\hat x_t)\| \leq C_d \frac{L}{2} \max_{i=1,\ldots,m-1} \|\hat x_{t-i} - \hat x_t\|$

for a dimension- and schedule-dependent $C_d$ .

Smooth Nonconvex Case:

For $\eta = \Theta(1/(d C_d L))$ and $T>m$ ,

$\frac{1}{T} \sum_{t=m}^{T-1} \|\nabla f(x_t)\|^2 \leq O\left(d C_d L \frac{f(x_m) - f^*}{T}\right).$

Strongly Convex Case:

For smooth $\mu$ -strongly convex objectives,

$f(x_T) - f^* \leq \left(1 - \Theta(\mu/(d C_d L))\right)^{T-m} (f(x_m) - f^*)$

Query Complexity:

| Setting | Two-point ZO | L-RESZO | |------------------------------|--------------|---------------| | Smooth nonconvex | $O(d/\varepsilon)$ | $O(C_d d/\varepsilon)$ | | Smooth $\mu$ -strongly convex | $O((d/\mu)\log(1/\varepsilon))$ | $O((C_d d/\mu)\log(1/\varepsilon))$ |

Empirically, $C_d$ behaves as $O(\sqrt{d})$ . This suggests that in high dimensions, L-RESZO achieves query complexity comparable (up to a moderate factor) to two-point ZO methods, outperforming standard SZO by a significant margin (Chen et al., 6 Jul 2025).

4. Empirical Performance and Practical Considerations

Comprehensive experiments on noiseless ridge regression, logistic regression, Rosenbrock, and neural network training with $d=100$ –$200$ confirm that both L-RESZO and Q-RESZO converge at essentially the same iteration-rate as two-point ZO, while using only one query per step. Thus, in terms of function query complexity, RESZO is approximately twice as efficient. Both RESZO variants also substantially outperform residual-feedback SZO. Q-RESZO demonstrates slightly faster convergence than L-RESZO due to access to basic curvature information.

Stability and precision are sensitive to the perturbation radius $\delta$ :

$\delta=0$ causes oscillations or divergence.
Small, positive $\delta$ increases precision but can hurt stability if too small.
Adapting $\delta_t = \eta\|g_{t-1}\|$ provides a balance between stability and optimality.
Window size: $m \geq d$ is necessary for full-rank surrogate fitting; $m \approx d+10$ is used in practice.
Overhead: Each iteration requires an $m \times (d+1)$ or $m \times (2d+1)$ least-squares regression, which can be efficiently updated via rank-one matrix updates.

5. Advantages, Limitations, and Comparison

Advantages

Single function query per step with far superior variance and convergence properties than classic one-point estimators.
Systematic reuse of historical data: All available function calls are utilized for each gradient estimation.
Rates matching two-point ZO: Up to a moderate, empirically mild factor ( $C_d=O(\sqrt{d})$ ).

Limitations

Assumption A2 dependence: The full theoretical guarantee requires that regression error, as encapsulated by $C_d$ , remains bounded—a property empirically observed, but without sharp theoretical bounds for large $d$ .
Noiseless analysis: Current convergence results apply only in deterministic function settings.
Storage and batch-size: Maintaining a buffer of at least $d$ past queries is necessary for surrogate regression.

Comparison with other ZO methods

Method	Queries per step	Uses history	Query complexity (nonconvex)
Classic SZO	1	No	$O(d^{3/2}/\varepsilon^{3/2})$
Two-point ZO	2	Not required	$O(d/\varepsilon)$
Residual-feedback SZO	1	Previous eval only	Improved, but not regression-based
RESZO (proposed)	1	Yes (window $m$ )	$O(C_d d/\varepsilon)$

6. Applications and Extensions

RESZO is particularly advantageous in settings where only single function queries are feasible at each iteration:

Online and dynamic optimization, where the objective may change over time and repeated querying is impossible.
Bandit settings, expensive simulation, and hyperparameter tuning.
Reinforcement learning and power systems control, where function evaluation is costly or resource-limited.
Safety-critical control systems, where repeated, identical actions are not permissible.

A plausible implication is the applicability of RESZO to reinforcement learning and simulation-based policy optimization under severe query limitations.

7. Open Problems and Future Directions

Although RESZO marks a substantial advance for single-point ZO, several technical challenges remain:

Extending theory to noisy evaluations (stochastic objectives).
Developing high-probability regret/convergence bounds.
Rigorous bounding of the regression constant $C_d$ for high-dimensional regimes.
Improving adaptive strategies for window size and perturbation radius.
Incorporating variance reduction and acceleration mechanisms.

Potential extensions may include mirror-descent variants, non-Euclidean sampling schemes, or combination with control-oriented feedback designs (Chen et al., 6 Jul 2025).

For the definitive introduction, formal algorithmic details, theoretical analysis, and empirical comparisons, see "Regression-Based Single-Point Zeroth-Order Optimization" (Chen et al., 6 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Regression-Based Single-Point Zeroth-Order Optimization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Regression-Based Single-Point ZO (RESZO).