Two-Point Random Gradient Estimator

Updated 8 February 2026

The two-point random gradient estimator is a zeroth-order method that approximates gradients using two noisy function evaluations at random perturbations.
It smooths non-smooth functions and reduces bias, proving crucial in black-box optimization, feedback control, and high-dimensional online learning.
Its performance hinges on the choice of randomization scheme and geometry, which influence convergence rates, sample complexity, and robustness to noise.

A two-point random gradient estimator is a zeroth-order optimization tool that constructs an unbiased or asymptotically unbiased estimate of a function's gradient using only two noisy function evaluations at random perturbations of the current point. It is foundational in model-free optimization, black-box feedback control, and high-dimensional online learning, where gradient information is unavailable or infeasible to compute. The estimator's properties—variance, bias, minimality, adaptivity—are governed by the choice of randomization scheme and the geometry of the problem, with substantial impacts on convergence rates, sample complexity, and robustness to noise.

1. Mathematical Formulations and Fundamental Properties

Given a smooth function $f:\mathbb{R}^d\to\mathbb{R}$ and a random direction $u$ drawn from a prescribed distribution, the canonical two-point estimator is given by

$g_{\delta}(x;u) = \frac{f(x+\delta u)-f(x-\delta u)}{2\delta}\, u, \quad \delta>0$

(Ma et al., 22 Oct 2025). For fixed-perturbation estimators (e.g., forward difference), a closely related form is

$\tilde{g}_\delta(x; v) = \frac{v}{\delta}\left[f(x+\delta v) - f(x)\right]$

(Mehrnoosh et al., 15 Sep 2025). The estimator is asymptotically unbiased as $\delta\to 0$ , and under mild regularity on $f$ ( $C^2$ smoothness),

$\mathbb{E}_u[g_\delta(x; u)] = \nabla f(x) + O(\delta^2)$

(Ma et al., 22 Oct 2025). The key requirement is the so-called $I$ -unbiasedness condition: $\mathbb{E}[u u^\top]=I_d$ , ensuring correct mean scaling.

The estimator’s variance, critical for optimization efficiency, is strongly dictated by the distribution of $u$ . For small $\delta$ and $a=\nabla f(x)$ ,

$\mathbb{E}_u \left\| (u^\top a) u - a \right\|^2 = a^\top \left( \mathbb{E}[ (u u^\top)^2 ] - I_d \right) a$

Thus, finding randomizations that minimize $\mathbb{E}[ (u u^\top)^2 ]$ under the unbiasedness constraint is essential (Ma et al., 22 Oct 2025). This analysis underpins recent developments in minimum-variance and geometry-adapted randomization schemes.

2. Bias, Variance, and Smoothing Effects

Two-point estimators smooth a possibly nonsmooth function by convolution with the perturbation distribution's measure, yielding a differentiable surrogate $f_\delta$ . By linearity: $f_\delta(x) := \mathbb{E}_{u}[f(x + \delta u)]$ with $\nabla f_\delta(x) = \mathbb{E}_u[g_\delta(x; u)]$ . The bias between $f_\delta$ and $f$ is bounded and scales as $O(\delta^2)$ (under $L$ -smoothness or $L$ -Lipschitz assumptions) (Mehrnoosh et al., 15 Sep 2025, Akhavan et al., 2022). For specific randomizations, the smoothing error is further characterized: $|f_h(x) - f(x)| \leq L h\, b_q(d)$ with $b_q(d)$ dependent on the geometry (e.g., $\ell_1$ or $\ell_2$ ) and dimension (Akhavan et al., 2022).

Variance bounds depend crucially on the distribution of $u$ . For Gaussian or uniform-sphere $u$ , the variance of the estimator is typically $O(d \| \nabla f_\delta \|^2)$ , while for the one-point estimator it is $O(d^2)$ (Mehrnoosh et al., 15 Sep 2025). For $\ell_1$ -randomized estimators, dimension-dependent constants and a weighted Poincaré inequality provide precise variance scaling (Akhavan et al., 2022).

3. Optimal Randomization: Minimum-Variance and Directional Schemes

Minimizing estimator variance under unbiasedness leads to a constrained optimization over the space of distributions for $u$ : $\min_{V: \mathbb{E}_{u\sim V}[uu^\top]=I_d} a^\top \mathbb{E}_{u\sim V}[(uu^\top)^2] a$ (Ma et al., 22 Oct 2025). The optimal solutions are split into two analytic families:

Fixed-length randomization: $u$ is supported on $\{u : \|u\|^2 = d\}$ with $\mathbb{E}[uu^\top]=I_d$ (e.g., uniform sphere, Rademacher, random basis).
Directionally Aligned Perturbations (DAP): $u$ satisfies $(u^\top a)^2 = \|a\|^2$ , i.e., exactly aligned or antialigned with the gradient, with $u$ further distributed to keep $\mathbb{E}[uu^\top]=I_d$ .

DAPs can be implemented practically by projecting random samples onto hyperplanes aligned with an estimate of $\nabla f$ (Ma et al., 22 Oct 2025). In settings where the underlying geometry is non-Euclidean, randomization on the $\ell_1$ -sphere (as in $g(x; \zeta)$ for $\zeta\sim\mathrm{Unif}(\mathbb{S}_1^d)$ ) becomes theoretically and empirically advantageous (Akhavan et al., 2022).

Method	Randomization	Key Scaling
Uniform sphere	$\\|u\\|^2=d$	$O(d)$ variance
Gaussian	$u\sim N(0, I)$	$O(d^2)$ variance
$\ell_1$ -sphere	$\\|\zeta\\|_1=1$	$O(d\log d)$ regret
DAP	$(u^\top a)^2=\\|a\\|^2$	Optimal variance $O(d)$

4. Embedding in Optimization Algorithms

Two-point random gradient estimators are embedded in various algorithmic frameworks for black-box optimization, online learning, and feedback control.

Feedback Optimization for Plants: The estimator $g_k^\delta$ , formulated using two consecutive real-time plant evaluations under random perturbations, drives a gradient-free feedback update law for steady-state input selection. Convergence to $\epsilon$ -stationary points for smooth, nonconvex costs is provable at rate $O(\epsilon^{-1})$ , outperforming one-point methods (Mehrnoosh et al., 15 Sep 2025).
Online Dual Averaging: In online convex optimization settings, the estimator is used to drive mirror descent or dual-averaging iterates, with stepsizes and smoothing radius possibly chosen adaptively. For $\ell_1$ -sphere randomization, regret bounds match or improve prior work: $O(L\sqrt{dT})$ for Euclidean balls, $O(L\sqrt{dT \log d})$ for the simplex (Akhavan et al., 2022).
Zeroth-Order SGD: Fixed-length or DAP randomizations generate direction perturbations for each iterate, producing unbiased stochastic gradients. For functions with $L$ -smoothness and bounded fourth perturbation moment, step sizes $\eta\sim T^{-1/2}$ achieve $O(\sqrt{1/T})$ convergence in the mean-squared norm of the gradient, with optimal $O(d/\epsilon^{2})$ sample complexity (Ma et al., 22 Oct 2025).

5. Convergence Analysis and Parameter Selection

Theoretical convergence rates are determined by balancing step size $\eta$ , smoothing parameter $\delta$ , and noise/error levels. For the feedback optimization setting (Mehrnoosh et al., 15 Sep 2025), the key parameters satisfy:

$\eta < 1 / [8L(p+4)]$
$\delta^2 \leq 2\epsilon_\Phi/(Lp)$
Optimal $\delta^2 \propto \sqrt{\mu}$ for plant error $\mu$

Under these settings, after $T$ iterations,

$\frac{1}{T} \sum_{k=0}^{T-1} \mathbb{E}\|\nabla \tilde{\Phi}(u_k)\|^2 \leq \epsilon$

with overall complexity $O(\epsilon^{-1})$ . In online convex optimization, regret bounds also reflect dimension and geometry, with parameter-free variants achieving optimal rates adaptively (Akhavan et al., 2022). For stochastic zeroth-order SGD, the convergence rate in the nonconvex case is

$\min_{1\leq t\leq T} \mathbb{E}\|\nabla f(x_t)\|^2 \leq \frac{f(x_1) - f^*}{\eta T} + O(\eta L^2 d)$

with $T=O(d/\epsilon^2)$ iterations for precision $\epsilon$ (Ma et al., 22 Oct 2025).

6. Practical Implementations: Randomization Schemes and Robustness

Implementation details are shaped by problem structure:

Gaussian and uniform-sphere samplers suit settings with isotropic geometry.
$\ell_1$ -sphere randomization is advantageous for simplex-structured or sparse problems (Akhavan et al., 2022).
DAPs, requiring an ongoing estimate of the gradient for alignment, reduce estimator variance in “effective” directions—empirically yielding significantly smaller MSE and faster convergence in high variance coordinates or “needle-in-haystack” settings (Ma et al., 22 Oct 2025).

Adaptive step-size and smoothing-radius schedules, often of the form $\eta_t \propto 1/\sqrt{\sum_{k=1}^{t-1} \|g_k\|^2}$ and $h_t \propto 1/\sqrt{t}$ , drive parameter-free operation ensuring convergence without prior knowledge of Lipschitz or noise parameters (Akhavan et al., 2022).

Simulator studies (e.g., 10-state, 5-input nonlinear plant, quadratic costs) confirm theoretical predictions: two-point estimators achieve $O(\epsilon^{-1})$ rates, matching model-based or ideal estimators and outperforming one-point schemes, with their performance robust (within limits) to choice of $\delta$ and $\eta$ (Mehrnoosh et al., 15 Sep 2025).

7. Extensions and Connections to Prior Work

Two-point estimators form the basis for a large class of black-box and feedback optimization algorithms, generalizing and often improving upon one-point (finite difference) or randomly perturbed function evaluation schemes. Notably,

Duchi et al. and Shamir achieved optimal rates for $\ell_2/\ell_\infty$ cases using sphere/axis randomization; recent $\ell_1$ -based schemes close logarithmic gaps in dimensional dependence (Akhavan et al., 2022).
The concept of minimum-variance randomizations unifies prior work, with DAPs and fixed-length randomizations yielding the best-possible rates for general smooth objectives and uniform-sphere or simplex geometries (Ma et al., 22 Oct 2025).

This family of estimators continues to evolve, with new randomization schemes and variance reduction strategies targeting increasingly high-dimensional and high-noise applications. Directionally aligned perturbations and geometry-matched randomizations delineate current directions for minimizing sample complexity and improving real-time or adversarial robustness.