Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Zero-Order Optimization Techniques

Updated 1 July 2025

Zero-order optimization is a derivative-free approach that minimizes functions using only function evaluations.
It employs finite-difference and kernel-based estimators to approximate gradients efficiently in high-dimensional and noisy environments.
Applications span machine learning, control, and simulation, making it vital for problems where gradient information is inaccessible.

Zero-order optimization, also known as derivative-free optimization, addresses the problem of minimizing an objective function $f(x)$ when access to its gradient $\nabla f(x)$ is unavailable or computationally prohibitive. Instead, algorithms rely solely on querying function values $f(x)$ at selected points. This scenario arises frequently in real-world applications where the function is a black box, a complex simulation, or part of a system where direct differentiation is not possible. The core challenge in zero-order optimization lies in efficiently estimating gradient information from function values to guide the search process, often in high-dimensional spaces and in the presence of noise.

Gradient Estimation and Core Algorithms

The foundational element of zero-order optimization algorithms is the estimation of gradient information using function evaluations. A common approach involves finite-difference approximations along random directions.

The standard two-point symmetric difference estimator for a function $f(x)$ at point $x$ with smoothing parameter $\mu > 0$ and random direction $u$ (e.g., sampled from a spherical or Gaussian distribution) is given by: $\hat{g}(x; \mu, u) = \frac{f(x + \mu u) - f(x - \mu u)}{2\mu} u$ This estimator provides a nearly unbiased estimate of the gradient of a smoothed version of the function $f_\mu(x) = \mathbb{E}_{u \sim \mathcal{D}}[f(x + \mu u)]$ , where $\mathcal{D}$ is the distribution of $u$ . The variance of this estimator typically scales with the dimension $d$ .

Alternatively, coordinate-wise finite differences estimate partial derivatives along standard basis vectors $e_j$ : $\hat{g}_j(x; \mu, e_j) = \frac{f(x + \mu e_j) - f(x)}{ \mu} e_j$ A full coordinate-wise gradient estimate requires $d$ such queries (or $2d$ for symmetric differences), which can be computationally expensive in high dimensions.

Hybrid estimators combine the benefits of random-direction and coordinate-wise methods. The "Zeroth-Order Hybrid Gradient Descent (ZO-HGD)" method (2012.11518) proposes a convex combination of a random gradient estimator (RGE) and a coordinate-wise gradient estimator (CGE) with importance sampling. The HGE formula is $\hat{\nabla}_{\mathrm{HGE}} F(\mathbf{x}) = \alpha \ \hat{\nabla}_{\mathrm{RGE}} F(\mathbf{x}) + (1-\alpha) \ \hat{\nabla}_{\mathrm{CGE}} F_{\mathcal{I}(\mathbf{x}; \mathbf{p})}$ , where $\alpha \in [0, 1]$ balances the two, and $\mathcal{I}$ is a subset of coordinates sampled with probabilities $\mathbf{p}$ based on importance. This allows tuning the trade-off between query efficiency (RGE) and variance reduction (CGE).

For non-smooth or highly smooth functions, kernel-based gradient approximations can be employed. These methods use a smoothing kernel function $K(r)$ along with random directions to construct estimators that exploit higher-order smoothness properties, achieving better bias-variance trade-offs as the smoothness order $\beta$ increases (2006.07862, 2305.15828, 2310.02371). For instance, an estimator might involve $\frac{d}{2\gamma} \Big[ f(x + \gamma r \mathbf{e}) - f(x - \gamma r \mathbf{e}) \Big] K(r) \mathbf{e}$ , where $r$ is a random scalar.

Zero-order algorithms typically follow an iterative scheme akin to gradient descent, replacing the true gradient with its zero-order estimate. For instance, a basic update could be $x_{k+1} = x_k - \eta_k \hat{g}_k$ , where $\eta_k$ is the step size. More advanced methods adapt techniques from first-order optimization, such as accelerated gradient descent or mirror descent, by incorporating zero-order gradient estimates (1312.2139, 2310.02371).

Convergence Theory and Dimensionality Dependence

A central focus in zero-order optimization research is establishing convergence rates and understanding their dependence on the problem dimension $d$ . For convex optimization, the cost of not having gradients translates into a performance penalty.

Early analysis showed that single-point query methods for convex optimization can suffer convergence rates quadratic in $d$ . However, the availability of paired function values significantly improves the situation. For convex, smooth, stochastic optimization, two-point methods using random direction gradient estimates achieve a convergence rate for the expected optimality gap of $O\left(\frac{RL\sqrt{d}}{\sqrt{T}}\right)$ after $T$ iterations (1312.2139). Here, $R$ is the radius of the feasible set and $L$ is the Lipschitz constant of the gradient. This rate is only a factor of $\sqrt{d}$ worse than the $O\left(\frac{RL}{\sqrt{T}}\right)$ rate achievable by stochastic gradient methods when gradients are available. This result was shown to be minimax optimal, matching information-theoretic lower bounds up to logarithmic factors, establishing $\sqrt{d}$ as an unavoidable penalty for black-box convex optimization using two function evaluations per iteration (1312.2139). For non-smooth convex functions, a similar $\sqrt{d \log d}$ penalty is incurred.

For unconstrained online convex optimization, a one-point Gaussian smoothing method achieves an $O(n^{2/3}T^{2/3})$ regret bound, while a two-point adaptation matches the optimal $O(\sqrt{nT})$ regret, confirming the power of paired evaluations in online settings (1806.05069).

In the field of non-convex optimization, finding approximate stationary points is a standard goal. Recent breakthroughs have addressed the dimension dependence for nonsmooth non-convex stochastic optimization. Prior state-of-the-art algorithms for finding $(\delta, \epsilon)$ -stationary points suffered from a dimension dependence of $\Omega(d^{3/2})$ . A novel algorithm has shown that this dependence is not optimal, achieving a complexity of $O(d\delta^{-1}\epsilon^{-3})$ , which is optimal with respect to $d$ and also optimal with respect to the accuracy parameters $\delta, \epsilon$ . This is achieved by leveraging a connection to the Goldstein-subdifferential and utilizing first-order methods whose complexity depends on the Lipschitz constant rather than the smoothness constant, effectively showing that nonsmooth optimization is as easy as smooth optimization in this setting (2307.04504).

For smooth non-convex problems, the standard $O(d/\epsilon^2)$ complexity for finding $\epsilon$ -stationary points using random gradient estimates is well-established.

Handling Constraints: Safe and Projected Methods

Constrained zero-order optimization presents unique challenges, particularly when the feasible set or constraint functions are unknown or black-box. Methods for handling constraints in the presence of only function value access include projections onto known sets and techniques for approximating constraints.

For constrained problems where the feasible set is a known convex set, projected gradient methods can be adapted by projecting the iterates after taking a step in the direction of the zero-order gradient estimate.

For problems with black-box functional constraints $f_i(x) \leq 0$ , safe zero-order optimization methods aim to guarantee that all iterates and sampled points remain feasible. One approach constructs local quadratic approximations of the constraint functions around a strictly feasible point and optimizes over the intersection of the corresponding local feasible sets (2303.16659). At each iteration, a quadratically constrained quadratic program (QCQP) subproblem is solved. This method guarantees feasibility of all iterates and convergence to an approximate KKT pair under mild assumptions, requiring $O(d^2/\eta^2)$ samples to find an $\eta$ -KKT pair.

An alternative safe method for black-box constraints employs linear programming (LP) subproblems (2304.01797). At each step, gradients of the objective and near-active constraints are estimated using finite differences. An LP is solved to find a descent direction that also moves away from constraint boundaries. A safe step size is determined using local quadratic approximations. This SZO-LP method guarantees feasibility and convergence to a KKT point, demonstrating computational and sample efficiency advantages over previous safe zero-order methods, especially for problems with many constraints, as shown on an Optimal Power Flow problem.

For composite optimization problems with a smooth black-box part and a known non-smooth part, augmented Lagrangian methods (ALM) can be adapted. A "Zeroth-Order Inexact Augmented Lagrangian Method (ZO-iALM)" has been proposed to solve problems with composite objectives and general functional constraints (possibly nonconvex) (2112.11420). This method uses nested loops, including a zero-order inexact proximal point method and a zero-order accelerated proximal coordinate update method as subsolvers, employing multi-point coordinate gradient estimators for accuracy. This approach achieves query complexity results matching the best-known first-order rates up to a factor of $d$ , being the first ZO method to do so for this class of general constrained problems.

Robustness to Noise and Uncertainty

Zero-order methods often operate in environments where function evaluations are inherently noisy. Robustness to noise is a critical aspect of their design and analysis. Noise sources can be stochastic (e.g., measurement error, subsampling in machine learning) or adversarial (bounded but arbitrary).

The two-point gradient estimator based on random directions is naturally robust to unbiased stochastic noise, as the expectation removes the noise term, leaving only the function value difference (1312.2139). However, noise can inflate the variance of the estimator, impacting convergence rates and achievable accuracy.

More general noise models, including adversarial noise, can be handled. For distributed zero-order optimization of strongly convex functions, algorithms have been developed that tolerate a general noise model (not requiring zero mean or independence) by carefully designing the gradient estimator using kernel functions (2102.01121). The analysis provides convergence bounds highlighting the interplay between noise level, dimension, strong convexity, smoothness, and network connectivity.

In the context of model predictive control (MPC) for systems with uncertainty, zero-order approaches can enhance robustness. The "zero-order robust optimization (zoRO)" scheme addresses uncertainty in MPC by propagating uncertainty (e.g., disturbance ellipsoids) outside the main optimization loop, using the propagated uncertainty to tighten constraints on the nominal trajectory. This method effectively approximates the Jacobian of uncertainty propagation terms in the optimization, decoupling them from the nominal state/control optimization and significantly reducing computational complexity while maintaining feasibility guarantees (2306.17445, 2311.04557). This tailored Jacobian approximation renders the optimization over uncertainty zero-order with respect to nominal trajectory variables. It allows real-time implementation on resource-constrained hardware and integration with high-performance MPC frameworks like acados.

The convergence analysis of zero-order methods often explicitly quantifies the impact of noise variance on the achievable accuracy or error floor, demonstrating how factors like batch size or smoothing parameters can be tuned to mitigate noise effects (2305.15828, 2310.02371).

Scalability and Advanced Techniques

Scaling zero-order optimization to high-dimensional problems is crucial for many applications. Beyond basic gradient estimation, several techniques address the challenges of high dimensionality and specific problem structures.

Distributed Zero-Order Optimization: For problems where the objective is a sum of functions distributed across a network of agents who can only communicate with neighbors and access local function values, distributed zero-order methods are necessary. Algorithms like "Zeroth-Order Feedback Optimization (ZFO)" enable model-free distributed optimization by having agents estimate their local partial gradients using exchanged local cost differences (2011.09728). The ZFO algorithm provides complexity bounds for convex and nonconvex settings under noisy observations and shows improved efficiency when agents know the local function dependence structure. Distributed zero-order methods under adversarial noise have also been developed, providing convergence guarantees that depend on network connectivity (2102.01121).

Huge-Scale and Sparse Problems: When the dimension $d$ is extremely large (e.g., $>10^6$ ) and standard vector operations are infeasible, specialized "huge-scale" zero-order methods are needed. If the gradient is known or assumed to be sparse, compressed sensing techniques can be integrated into zero-order optimization. The "Zeroth-Order Block Coordinate Descent (ZO-BCD)" algorithm partitions variables into blocks and estimates sparse block gradients using function queries and sparse recovery techniques (like CoSaMP) (2102.10707). Using randomized block assignments and special measurement matrices like partial circulant matrices further reduces per-iteration computational cost and memory footprint, enabling optimization in dimensions exceeding 1.7 million, as demonstrated for adversarial attacks in the wavelet domain.

Exploiting Structure: Beyond sparsity, other structural properties can be leveraged. Higher-order smoothness (i.e., differentiability beyond standard Lipschitz continuous gradients) allows for faster convergence rates in first-order methods. Zero-order methods using kernel-based estimators can exploit such smoothness to improve query complexity and noise robustness (2006.07862, 2305.15828, 2310.02371). For strongly convex problems with higher-order smoothness, kernel approximation methods achieve near-minimax optimal rates matching those of gradient-based methods up to a factor of $d$ . For problems satisfying the Polyak-Łojasiewicz (PL) condition, which implies linear convergence for first-order methods, kernel-based zero-order methods also achieve improved error floors and noise tolerance by effectively exploiting higher smoothness (2305.15828, 2310.02371).

Quantized Optimization: In resource-constrained environments with limited memory and computational capacity, optimizing directly on quantized parameters is beneficial. "Zero-Order Quantized Optimization (ZOQO)" proposes using zero-order gradient sign approximations and quantized updates to train models with quantized parameters without needing full-precision gradients (2501.06736). This approach leverages quantized arithmetic throughout the process, enabling memory-efficient adaptation for large models, such as fine-tuning LLMs or generating black-box adversarial attacks, on low-resource hardware.

Practical Applications

Zero-order optimization techniques are applicable across a wide spectrum of fields, particularly where gradient information is inaccessible.

In Machine Learning, ZO methods are essential for:

Black-box adversarial attacks: Generating small input perturbations that cause misclassification of models where gradients are not exposed (2012.11518, 2102.10707).
Neural Architecture Search (NAS): Optimizing architecture parameters based on validation performance, which may involve non-differentiable metrics or models. The ZARTS framework leverages ZO for robust and stable architecture search, avoiding pitfalls of gradient approximations in differentiable NAS methods (2110.04743).
Hyperparameter Tuning: Optimizing performance metrics of complex models over discrete or continuous hyperparameter spaces where gradient-based methods are not applicable.
LLM Fine-Tuning: Adapting massive pre-trained LLMs on limited hardware where full backpropagation is infeasible, using only loss evaluations (2501.06736, 2506.05454).

In Control and Robotics, ZO is used for:

Model Predictive Control (MPC): Optimizing control inputs based on system dynamics and constraints. GP-based MPC incorporates uncertainty, and zero-order methods can approximate Jacobian terms related to uncertainty propagation to reduce computational cost, enabling real-time application (2211.15522, 2306.17445, 2311.04557).
Multi-Agent Systems: Distributed optimization and control in networked systems where agents have limited information and communication (2011.09728).
Adaptive Control: Tuning controller parameters based on system performance metrics that may not be differentiable with respect to parameters.

In Engineering and General Black-Box Problems:

Simulation-based Optimization: Optimizing parameters for systems modeled by complex simulations where derivatives are unavailable.
Resource Allocation: Solving optimization problems with combinatorial or black-box constraints in areas like sensor networks (2112.11420).
Power Systems Operations: Optimizing parameters like generation costs in power grids subject to complex, often black-box, operational constraints (2304.01797).

Recent research also highlights an intriguing form of Implicit Regularization in zero-order optimization. ZO methods, particularly those using the two-point random direction estimator, implicitly minimize a smoothed version of the objective function which corresponds to the original function plus a term proportional to the trace of the Hessian. This suggests that zero-order optimization may inherently favor "flat minima" (minimizers with small Hessian trace) over "sharp minima," a property empirically linked to better generalization in machine learning (2506.05454). This finding provides a new perspective on the effectiveness of ZO methods in practical settings like LLM fine-tuning, beyond simply being a workaround for inaccessible gradients.

\documentclass{article}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{geometry}
\geometry{a4paper, margin=1in}
\usepackage{hyperref}

\begin{document}

Zero-order optimization, also known as derivative-free optimization, addresses the problem of minimizing an objective function %%%%50%%%% when access to its gradient %%%%51%%%% is unavailable or computationally prohibitive. Instead, algorithms rely solely on querying function values %%%%52%%%% at selected points. This scenario arises frequently in real-world applications where the function is a black box, a complex simulation, or part of a system where direct differentiation is not possible. The core challenge in zero-order optimization lies in efficiently estimating gradient information from function values to guide the search process, often in high-dimensional spaces and in the presence of noise.

\section{Gradient Estimation and Core Algorithms}

The foundational element of zero-order optimization algorithms is the estimation of gradient information using function evaluations. A common approach involves finite-difference approximations along random directions.

The standard two-point symmetric difference estimator for a function %%%%53%%%% at point %%%%54%%%% with smoothing parameter %%%%55%%%% and random direction %%%%56%%%% (e.g., sampled from a spherical or Gaussian distribution) is given by: %%%%102%%%%
This estimator provides a nearly unbiased estimate of the gradient of a smoothed version of the function %%%%57%%%%, where %%%%58%%%% is the distribution of %%%%59%%%%. The variance of this estimator typically scales with the dimension %%%%60%%%%.

Alternatively, coordinate-wise finite differences estimate partial derivatives along standard basis vectors %%%%61%%%%: %%%%103%%%%
A full coordinate-wise gradient estimate requires %%%%62%%%% such queries (or %%%%63%%%% for symmetric differences), which can be computationally expensive in high dimensions.

Hybrid estimators combine the benefits of random-direction and coordinate-wise methods. The "Zeroth-Order Hybrid Gradient Descent (ZO-HGD)" method \cite{2012.11518} proposes a convex combination of a random gradient estimator (RGE) and a coordinate-wise gradient estimator (CGE) with importance sampling. The HGE formula is %%%%64%%%%, where %%%%65%%%% balances the two, and %%%%66%%%% is a subset of coordinates sampled with probabilities %%%%67%%%% based on importance. This allows tuning the trade-off between query efficiency (RGE) and variance reduction (CGE).

For non-smooth or highly smooth functions, kernel-based gradient approximations can be employed. These methods use a smoothing kernel function %%%%68%%%% along with random directions to construct estimators that exploit higher-order smoothness properties, achieving better bias-variance trade-offs as the smoothness order %%%%69%%%% increases \cite{2006.07862, 2305.15828, 2310.02371}. For instance, an estimator might involve %%%%70%%%%, where %%%%71%%%% is a random scalar.

Zero-order algorithms typically follow an iterative scheme akin to gradient descent, replacing the true gradient with its zero-order estimate. For instance, a basic update could be %%%%72%%%%, where %%%%73%%%% is the step size. More advanced methods adapt techniques from first-order optimization, such as accelerated gradient descent or mirror descent, by incorporating zero-order gradient estimates \cite{1312.2139, 2310.02371}.

\section{Convergence Theory and Dimensionality Dependence}

A central focus in zero-order optimization research is establishing convergence rates and understanding their dependence on the problem dimension %%%%74%%%%. For convex optimization, the cost of not having gradients translates into a performance penalty.

Early analysis showed that single-point query methods for convex optimization can suffer convergence rates quadratic in %%%%75%%%%. However, the availability of paired function values significantly improves the situation. For convex, smooth, stochastic optimization, two-point methods using random direction gradient estimates achieve a convergence rate for the expected optimality gap of %%%%76%%%% after %%%%77%%%% iterations \cite{1312.2139}. Here, %%%%78%%%% is the radius of the feasible set and %%%%79%%%% is the Lipschitz constant of the gradient. This rate is only a factor of %%%%80%%%% worse than the %%%%81%%%% rate achievable by stochastic gradient methods when gradients are available. This result was shown to be minimax optimal, matching information-theoretic lower bounds up to logarithmic factors, establishing %%%%82%%%% as an unavoidable penalty for black-box convex optimization using two function evaluations per iteration \cite{1312.2139}. For non-smooth convex functions, a similar %%%%83%%%% penalty is incurred.

For unconstrained online convex optimization, a one-point Gaussian smoothing method achieves an %%%%84%%%% regret bound, while a two-point adaptation matches the optimal %%%%85%%%% regret, confirming the power of paired evaluations in online settings \cite{1806.05069}.

In the field of non-convex optimization, finding approximate stationary points is a standard goal. Recent breakthroughs have addressed the dimension dependence for nonsmooth non-convex stochastic optimization. Prior state-of-the-art algorithms for finding %%%%86%%%%-stationary points suffered from a dimension dependence of %%%%87%%%%. A novel algorithm has shown that this dependence is not optimal, achieving a complexity of %%%%88%%%%, which is optimal with respect to %%%%89%%%% and also optimal with respect to the accuracy parameters %%%%90%%%%. This is achieved by leveraging a connection to the Goldstein-subdifferential and utilizing first-order methods whose complexity depends on the Lipschitz constant rather than the smoothness constant, effectively showing that nonsmooth optimization is as easy as smooth optimization in this setting \cite{2307.04504}.

For smooth non-convex problems, the standard %%%%91%%%% complexity for finding %%%%92%%%%-stationary points using random gradient estimates is well-established.

\section{Handling Constraints: Safe and Projected Methods}

Constrained zero-order optimization presents unique challenges, particularly when the feasible set or constraint functions are unknown or black-box. Methods for handling constraints in the presence of only function value access include projections onto known sets and techniques for approximating constraints.

For constrained problems where the feasible set is a known convex set, projected gradient methods can be adapted by projecting the iterates after taking a step in the direction of the zero-order gradient estimate.

For problems with black-box functional constraints %%%%93%%%%, safe zero-order optimization methods aim to guarantee that all iterates and sampled points remain feasible. One approach constructs local quadratic approximations of the constraint functions around a strictly feasible point and optimizes over the intersection of the corresponding local feasible sets \cite{2303.16659}. At each iteration, a quadratically constrained quadratic program (QCQP) subproblem is solved. This method guarantees feasibility of all iterates and convergence to an approximate KKT pair under mild assumptions, requiring %%%%94%%%% samples to find an %%%%95%%%%-KKT pair.

An alternative safe method for black-box constraints employs linear programming (LP) subproblems \cite{2304.01797}. At each step, gradients of the objective and near-active constraints are estimated using finite differences. An LP is solved to find a descent direction that also moves away from constraint boundaries. A safe step size is determined using local quadratic approximations. This SZO-LP method guarantees feasibility and convergence to a KKT point, demonstrating computational and sample efficiency advantages over previous safe zero-order methods, especially for problems with many constraints, as shown on an Optimal Power Flow problem.

For composite optimization problems with a smooth black-box part and a known non-smooth part, augmented Lagrangian methods (ALM) can be adapted. A "Zeroth-Order Inexact Augmented Lagrangian Method (ZO-iALM)" has been proposed to solve problems with composite objectives and general functional constraints (possibly nonconvex) \cite{2112.11420}. This method uses nested loops, including a zero-order inexact proximal point method and a zero-order accelerated proximal coordinate update method as subsolvers, employing multi-point coordinate gradient estimators for accuracy. This approach achieves query complexity results matching the best-known first-order rates up to a factor of %%%%96%%%%, being the first ZO method to do so for this class of general constrained problems.

\section{Robustness to Noise and Uncertainty}

Zero-order methods often operate in environments where function evaluations are inherently noisy. Robustness to noise is a critical aspect of their design and analysis. Noise sources can be stochastic (e.g., measurement error, subsampling in machine learning) or adversarial (bounded but arbitrary).

The two-point gradient estimator based on random directions is naturally robust to unbiased stochastic noise, as the expectation removes the noise term, leaving only the function value difference \cite{1312.2139}. However, noise can inflate the variance of the estimator, impacting convergence rates and achievable accuracy.

More general noise models, including adversarial noise, can be handled. For distributed zero-order optimization of strongly convex functions, algorithms have been developed that tolerate a general noise model (not requiring zero mean or independence) by carefully designing the gradient estimator using kernel functions \cite{2102.01121}. The analysis provides convergence bounds highlighting the interplay between noise level, dimension, strong convexity, smoothness, and network connectivity.

In the context of model predictive control (MPC) for systems with uncertainty, zero-order approaches can enhance robustness. The "zero-order robust optimization (zoRO)" scheme addresses uncertainty in MPC by propagating uncertainty (e.g., disturbance ellipsoids) outside the main optimization loop, using the propagated uncertainty to tighten constraints on the nominal trajectory. This method effectively approximates the Jacobian of uncertainty propagation terms in the optimization, decoupling them from the nominal state/control optimization and significantly reducing computational complexity while maintaining feasibility guarantees \cite{2306.17445, 2311.04557}. This tailored Jacobian approximation renders the optimization over uncertainty zero-order with respect to nominal trajectory variables. It allows real-time implementation on resource-constrained hardware and integration with high-performance MPC frameworks like acados.

The convergence analysis of zero-order methods often explicitly quantifies the impact of noise variance on the achievable accuracy or error floor, demonstrating how factors like batch size or smoothing parameters can be tuned to mitigate noise effects \cite{2305.15828, 2310.02371}.

\section{Scalability and Advanced Techniques}

Scaling zero-order optimization to high-dimensional problems is crucial for many applications. Beyond basic gradient estimation, several techniques address the challenges of high dimensionality and specific problem structures.

\textbf{Distributed Zero-Order Optimization:} For problems where the objective is a sum of functions distributed across a network of agents who can only communicate with neighbors and access local function values, distributed zero-order methods are necessary. Algorithms like "Zeroth-Order Feedback Optimization (ZFO)" enable model-free distributed optimization by having agents estimate their local partial gradients using exchanged local cost differences \cite{2011.09728}. The ZFO algorithm provides complexity bounds for convex and nonconvex settings under noisy observations and shows improved efficiency when agents know the local function dependence structure. Distributed zero-order methods under adversarial noise have also been developed, providing convergence guarantees that depend on network connectivity \cite{2102.01121}.

\textbf{Huge-Scale and Sparse Problems:} When the dimension %%%%97%%%% is extremely large (e.g., %%%%98%%%%) and standard vector operations are infeasible, specialized "huge-scale" zero-order methods are needed. If the gradient is known or assumed to be sparse, compressed sensing techniques can be integrated into zero-order optimization. The "Zeroth-Order Block Coordinate Descent (ZO-BCD)" algorithm partitions variables into blocks and estimates sparse block gradients using function queries and sparse recovery techniques (like CoSaMP) \cite{2102.10707}. Using randomized block assignments and special measurement matrices like partial circulant matrices further reduces per-iteration computational cost and memory footprint, enabling optimization in dimensions exceeding 1.7 million, as demonstrated for adversarial attacks in the wavelet domain.

\textbf{Exploiting Structure:} Beyond sparsity, other structural properties can be leveraged. Higher-order smoothness (i.e., differentiability beyond standard Lipschitz continuous gradients) allows for faster convergence rates in first-order methods. Zero-order methods using kernel-based estimators can exploit such smoothness to improve query complexity and noise robustness \cite{2006.07862, 2305.15828, 2310.02371}. For strongly convex problems with higher-order smoothness, kernel approximation methods achieve near-minimax optimal rates matching those of gradient-based methods up to a factor of %%%%99%%%%. For problems satisfying the Polyak-Łojasiewicz (PL) condition, which implies linear convergence for first-order methods, kernel-based zero-order methods also achieve improved error floors and noise tolerance by effectively exploiting higher smoothness \cite{2305.15828, 2310.02371}.

\textbf{Quantized Optimization:} In resource-constrained environments with limited memory and computational capacity, optimizing directly on quantized parameters is beneficial. "Zero-Order Quantized Optimization (ZOQO)" proposes using zero-order gradient sign approximations and quantized updates to train models with quantized parameters without needing full-precision gradients \cite{2501.06736}. This approach leverages quantized arithmetic throughout the process, enabling memory-efficient adaptation for large models, such as fine-tuning LLMs or generating black-box adversarial attacks, on low-resource hardware.

\section{Practical Applications}

Zero-order optimization techniques are applicable across a wide spectrum of fields, particularly where gradient information is inaccessible.

In \textbf{Machine Learning}, ZO methods are essential for:
\begin{itemize}
    \item \textbf{Black-box adversarial attacks:} Generating small input perturbations that cause misclassification of models where gradients are not exposed \cite{2012.11518, 2102.10707}.
    \item \textbf{Neural Architecture Search (NAS):} Optimizing architecture parameters based on validation performance, which may involve non-differentiable metrics or models. The ZARTS framework leverages ZO for robust and stable architecture search, avoiding pitfalls of gradient approximations in differentiable NAS methods \cite{2110.04743}.
    \item \textbf{Hyperparameter Tuning:} Optimizing performance metrics of complex models over discrete or continuous hyperparameter spaces where gradient-based methods are not applicable.
    \item \textbf{LLM Fine-Tuning:} Adapting massive pre-trained LLMs on limited hardware where full backpropagation is infeasible, using only loss evaluations \cite{2501.06736, 2506.05454}.
\end{itemize}

In \textbf{Control and Robotics}, ZO is used for:
\begin{itemize}
    \item \textbf{Model Predictive Control (MPC):} Optimizing control inputs based on system dynamics and constraints. GP-based MPC incorporates uncertainty, and zero-order methods can approximate Jacobian terms related to uncertainty propagation to reduce computational cost, enabling real-time application \cite{2211.15522, 2306.17445, 2311.04557}.
    \item \textbf{Multi-Agent Systems:} Distributed optimization and control in networked systems where agents have limited information and communication \cite{2011.09728}.
    \item \textbf{Adaptive Control:} Tuning controller parameters based on system performance metrics that may not be differentiable with respect to parameters.
\end{itemize}

In \textbf{Engineering and General Black-Box Problems}:
\begin{itemize}
    \item \textbf{Simulation-based Optimization:} Optimizing parameters for systems modeled by complex simulations where derivatives are unavailable.
    \item \textbf{Resource Allocation:} Solving optimization problems with combinatorial or black-box constraints in areas like sensor networks \cite{2112.11420}.
    \item \textbf{Power Systems Operations:} Optimizing parameters like generation costs in power grids subject to complex, often black-box, operational constraints \cite{2304.01797}.
\end{itemize}

Recent research also highlights an intriguing form of \textbf{Implicit Regularization} in zero-order optimization. ZO methods, particularly those using the two-point random direction estimator, implicitly minimize a smoothed version of the objective function which corresponds to the original function plus a term proportional to the trace of the Hessian. This suggests that zero-order optimization may inherently favor "flat minima" (minimizers with small Hessian trace) over "sharp minima," a property empirically linked to better generalization in machine learning \cite{2506.05454}. This finding provides a new perspective on the effectiveness of ZO methods in practical settings like LLM fine-tuning, beyond simply being a workaround for inaccessible gradients.

\bibliographystyle{plainnat}
\bibliography{references}

\end{document}