Zeroth-Order Frank-Wolfe (0-FW) Algorithm

Updated 13 October 2025

Zeroth-Order Frank-Wolfe (0-FW) is a family of projection-free optimization methods that use function evaluations to approximate gradients and solve constrained problems.
These methods incorporate finite-difference schemes, variance reduction, and momentum acceleration to achieve convergence rates comparable to first-order methods under dimension-dependent conditions.
0-FW algorithms have been empirically validated in large-scale machine learning tasks such as sparse regression, adversarial attacks, and semidefinite programming.

The Zeroth-Order Frank-Wolfe (0-FW) algorithm family extends projection-free optimization to settings where explicit gradient information is unavailable, and only function evaluations can be conducted. These methods address constrained stochastic and finite-sum optimization problems common in large-scale machine learning and black-box tasks. By integrating finite-difference gradient estimation, variance reduction, and momentum mechanisms, 0-FW achieves practical and theoretically principled convergence rates, often matching first-order analogs up to known dimension-dependent factors.

1. Zeroth-Order Gradient Approximation Schemes

Central to all 0-FW algorithms is the gradient approximation step, which replaces $\nabla f(x)$ with a surrogate constructed solely from function evaluations. Several gradient estimation strategies have been developed:

Kiefer-Wolfowitz Stochastic Approximation (KWSA): Approximates each coordinate using

$g(x; y) = \sum_{i=1}^{d} \frac{F(x + c_t e_i; y) - F(x; y)}{c_t} e_i$

requiring $d$ function queries per iteration. KWSA eliminates dimension dependence in the convergence rate.

Random Directions Stochastic Approximation (RDSA): Samples a single random direction $z_t$ (e.g., from $\mathcal{N}(0, I)$ ), yielding

$g(x_t; y_t, z_t) = \frac{F(x_t + c_t z_t; y_t) - F(x_t; y_t)}{c_t} z_t$

achieving lower query cost but introducing a dimension-dependent bias.

Improvised RDSA (I-RDSA): Samples $m < d$ random directions and averages the finite-difference estimates:

$g_m(x_t; y_t, \{z_{i,t}\}_{i=1}^{m}) = \frac{1}{m} \sum_{i=1}^{m} \frac{F(x_t + c_t z_{i,t}; y_t) - F(x_t; y_t)}{c_t} z_{i,t}$

allowing a tunable trade-off between sample efficiency and bias/variance.

Additionally, advanced estimators—such as two-point estimators

$\hat{\nabla} F(x) = \frac{d}{2\delta} \left(F(x + \delta u) - F(x - \delta u)\right) u$

with $u \sim \mathcal{S}^{d-1}$ —are used to obtain unbiased estimates under smoothness assumptions (Zhang, 2021).

2. Frank-Wolfe Updates and Projection-Free Optimization

After gradient estimation, the update mechanism retains the classic Frank-Wolfe sequence:

Linear Minimization Oracle (LMO):

$v_t = \arg\min_{v \in C} \langle d_t, v \rangle$

substituting $d_t$ for the unavailable $\nabla f(x_t)$ .

Convex Combination Update:

$x_{t+1} = (1 - \gamma_t) x_t + \gamma_t v_t$

with $\gamma_t$ typically a diminishing sequence.

The absence of explicit projections yields substantial computational savings, especially for feasible sets where projection is more expensive than solving linear subproblems (e.g., PSD cones, $\ell_1$ -balls).

Momentum-based gradient tracking and averaging schemes further refine $d_t$ , for example:

$d_t = (1 - \rho_t) d_{t-1} + \rho_t g(x_t; y_t, \cdot)$

with appropriately decaying $\rho_t$ to smooth noisy estimates and control tracking error (Sahu et al., 2018, Akhtar et al., 2021).

3. Variance Reduction and Acceleration Mechanisms

Recent advancements focus on reducing variance from gradient approximation and stochastic objective sampling, critical for practical efficiency:

SPIDER/SpiderBoost Variance Reduction: Recursively corrects gradient estimators using mini-batch differences (Huang et al., 2020):

$v_t = \frac{1}{|B|} \sum_{j \in B} (\hat{\nabla} f_j(z_t) - \hat{\nabla} f_j(z_{t-1})) + v_{t-1}$

Double Variance Reduction Framework: Simultaneously mitigates variance from both zeroth-order gradient noise and finite-sum sampling (Ye et al., 13 Jan 2025). The refined estimator update,

$g_{t+1} = g_t + \frac{b}{d+b+1} \hat{\nabla} f(x_{t+1}, U_{t+1}, \mu_{t+1}) - \frac{U_{t+1} U_{t+1}^\top}{d+b+1} g_t$

and probabilistic PAGE-inspired updates together achieve query-efficient convergence in high dimensions.

Momentum Acceleration: Sequences $(x_t, y_t, z_t)$ updated via weighted averaging impart acceleration and improved trade-offs analogous to Nesterov or Katyusha schemes (without relying on first-order gradients) (Huang et al., 2020).
Trimmed Variants: To reduce the number of LMO calls, schemes reuse previous solutions when the surrogate gradient updates are sufficiently small, decreasing total computational overhead while maintaining convergence (Akhtar et al., 2021).

4. Theoretical Guarantees and Convergence Rates

Zeroth-order Frank-Wolfe variants offer rigorous convergence bounds:

Gradient Estimator	Query Complexity (Convex)	Query Complexity (Nonconvex)	Dimension Factor
KWSA (coord.)	$O(T^{-1/3})$ , no dim. dependence	$O(T^{-1/4})$ , no dim. dependence	None
RDSA ( $m=1$ )	$O(d^{1/3}/T^{1/3})$	$O(d^{1/3}/T^{1/4})$	$d^{1/3}$
I-RDSA ( $m < d$ )	$O((d/m)^{1/3}/T^{1/3})$	$O((d/m)^{1/3}/T^{1/4})$	$(d/m)^{1/3}$
Two-point (sphere)	$O(d/\epsilon^3)$	$O(d/\epsilon^3)$	$d$
Double Variance Red.	$O(d\sqrt{n}/\epsilon)$	$O(d^{3/2}\sqrt{n}/\epsilon^2)$	$d$ , $d^{3/2}$

For convex functions, the primal sub-optimality gap is $O((d/m)^{1/3}/T^{1/3})$ —the best-known dim. dependence for one directional derivative per iteration (Sahu et al., 2018).
Nonconvex functions are analyzed via Frank-Wolfe duality gap $G(x) = \max_{v \in C} \langle \nabla f(x), x-v \rangle$ , yielding $O((d/m)^{1/3}/T^{1/4})$ (Sahu et al., 2018).
Recent frameworks attain $O(d\sqrt{n}/\epsilon)$ (convex) and $O(d^{3/2}\sqrt{n}/\epsilon^2)$ (nonconvex) for finite-sum objectives (Ye et al., 13 Jan 2025).

Dimension dependence in query complexity—such as $O(d^{1/3})$ or $O(d)$ —is inherent in finite-difference estimators using random or coordinate directions; the cited bounds are established as optimal under these oracle constraints.

5. Empirical Performance in High-Dimensional Machine Learning

Zeroth-order Frank-Wolfe algorithms have demonstrated empirical efficacy on benchmark and real-world optimization problems:

Sparse Regression (Lasso, Logistic): Algorithms achieve practical performance comparable to first-order Frank-Wolfe in terms of oracle calls, while scaling to hundreds or thousands of dimensions (Sahu et al., 2018, Ye et al., 13 Jan 2025).
Survival Analysis (Cox Regression): In high dimensions ( $d \approx 9376$ ), zeroth-order methods remain competitive despite dimension-dependent gaps relative to first-order analogs (Sahu et al., 2018).
Adversarial Attacks/Robust Classification: Accelerated variants attain faster decrease of objective and improved success rates against deep networks in black-box settings and under $\ell_1$ or $\ell_2$ constraints (Huang et al., 2020, Ye et al., 13 Jan 2025).
Semidefinite Programming: Efficiently solves problems arising in sparse matrix estimation, clustering via semidefinite relaxation, and uniform sparsest cut, all requiring challenging projection steps that are avoided via FW updates (Akhtar et al., 2021).
Monotone Submodular Maximization: Black-box continuous greedy algorithms extend 0-FW to settings where submodularity and monotonicity are present, establishing $(1-1/e)-\epsilon$ approximation guarantees (Zhang, 2021).

Improvements in query complexity and convergence, especially for double variance reduction and momentum-accelerated schemes, are validated across applications and datasets.

6. Algorithmic Trade-offs, Limitations, and Relations to Other Methods

Zeroth-order Frank-Wolfe algorithms present known trade-offs:

Dimension-Dependence: Unavoidable when using one (or few) directional derivatives per iteration, the optimal exponent in convergence rate is established at $1/3$ (Sahu et al., 2018).
Query Efficiency: KWSA removes dimension factors but at the expense of $d$ queries/iteration; RDSA/I-RDSA trade query cost for accuracy. Double variance reduction minimizes required queries for each iteration (Ye et al., 13 Jan 2025).
Momentum and Trimming: Acceleration via momentum and reduction of LMO calls through trimming intensify computational efficiency for large problems (Huang et al., 2020, Akhtar et al., 2021).
Comparison to 1-SFW/QFW: When gradients are available and projections affordable, first-order and quantized FW variants achieve superior rates. Zeroth-order methods are indispensable for black-box tasks or computationally expensive projections (Zhang, 2021).

7. Significance for Large-Scale and Black-Box Optimization

The 0-FW family delivers a theoretically grounded suite of projection- and gradient-free algorithms for modern machine learning, simulation, and statistical applications. By leveraging only function values, these methods broaden the class of solvable problems to those with inaccessible gradients and computationally costly projections. The convergence, empirical performance, and scalability in high dimensions position 0-FW algorithms as foundational tools in derivative-free and projection-free optimization.

Empirical and theoretical advances continue to consolidate zeroth-order Frank-Wolfe's role in practical black-box machine learning, with current research achieving the lowest known query complexities for these oracle models (Ye et al., 13 Jan 2025).