Derivative-Free Optimization Algorithms

Updated 2 January 2026

Derivative-free optimization algorithms are methods that solve optimization problems without gradient or Hessian information by employing surrogate models, pattern searches, and metaheuristics.
These techniques are categorized into model-based, finite-difference, direct-search, population-based, and Bayesian methods, each tailored for specific challenges like non-smooth or expensive evaluations.
Recent advances include enhanced trust-region surrogates, adaptive finite-difference estimators, and learning-based approaches that improve convergence rates and scalability in complex settings.

Derivative-free optimization (DFO) algorithms are a broad class of numerical schemes that solve optimization problems where derivative information of the objective is unavailable, unreliable, or impractical to obtain. DFO is central to applications involving black-box, expensive, or non-smooth objective functions, with widespread relevance in simulation-based engineering, machine learning, chemical process modeling, quantum chemistry, and industrial design.

1. Algorithmic Foundations and Taxonomy

DFO methods can be categorized by the mechanism used to explore and exploit the search space without explicit gradient or Hessian computation. Principal categories include:

Model-based and interpolation strategies: Build local surrogate models (e.g., polynomials, radial basis functions) by interpolating sampled points, and use these models to direct search (e.g., trust-region and interpolation-based optimization) (Roberts, 6 Oct 2025).
Finite-difference and direct search: Employ systematic perturbations or pattern searches to estimate gradients or directions, including Kiefer-Wolfowitz (KW), simultaneous perturbation stochastic approximation (SPSA), coordinate pattern search, and simplex algorithms (Nelder-Mead) (Du-Yi et al., 2024, Bagci, 30 Dec 2025).
Population-based metaheuristics: Maintain and evolve a population of candidate solutions via stochastic recombination, mutation, selection; includes genetic algorithms, differential evolution (DE), particle swarm optimization (PSO), evolution strategies (ES), and CMA-ES (Zhang, 2019, Auger et al., 2010).
Estimation of distribution algorithms (EDAs): Replace evolution operators with probabilistic modeling and sampling, such as EDAs based on Gaussian or heavy-tailed Student’s t distributions (Liu et al., 2016).
Bayesian and Lipschitzian global optimization: Place probabilistic priors (Gaussian processes) or exploit Lipschitz continuity for region elimination (DIRECT, LIPO, MCS, Shubert-Piyavskii) (Zhang, 2019, Phan et al., 2022).
Classification-based and active learning: Use classifiers to iteratively identify promising sublevel sets, updating the sampling distribution accordingly (Hashimoto et al., 2018, Han et al., 2023).
Dimension-reduction and subspace methods: Identify or construct low-dimensional subspaces for search, leveraging approximate gradients or output-based dimension reduction (Zhang, 8 Jan 2025, Gross et al., 2020).
Noncommutative map and Lie-bracket schemes: Discrete-time analogues of Lie-bracket extremum seeking, approximating gradients via cycles of deterministic functional perturbations (Feiling et al., 2018).

These core strategies are often hybridized or modified for constraints, noise, multi-fidelity settings, or parallel computing contexts.

2. Model-Based and Interpolation Trust-Region Algorithms

Model-based DFO, particularly interpolation-based trust-region (TR) methods, has emerged as a rigorously analyzable and high-performing class for local optimization. TR DFO proceeds via:

Construction of a local surrogate model (most commonly quadratic) that interpolates or approximates the objective at a set of sample points within a trust region.
Solution of a trust-region subproblem to generate a candidate step by minimizing the surrogate within a prescribed trust region.
Acceptance/rejection of the candidate based on the fidelity of the predicted versus actual reduction, with dynamic adjustment of the trust-region radius (Roberts, 6 Oct 2025, Auger et al., 2010).

Polynomial models may be determined by underdetermined interpolation, regularized by minimal norm conditions (e.g., least Frobenius norm, Sobolev $H^2$ norm). The $H^2$ -norm updating approach, for example, minimizes the average error in function value, gradient, and Hessian over the entire trust region, yielding provably superior model fidelity, particularly for expensive or highly nonlinear objectives. Algorithmic implementation involves maintenance of well-poised interpolation sets and efficient KKT-system updates, with complexity scaling polynomially in the number of points and variables (Xie et al., 2023).

Model-based methods also admit extensions for bound-constrained and composite objectives. Approaches exist that combine surrogate models originated from previous related problem instances, yielding significant function evaluation savings when sequences of related complex problems are solved (Curtis et al., 2022).

3. Finite Difference, Direct Search, and Pattern-based Algorithms

Finite-difference (FD) approaches, including KW and SPSA, generate gradient surrogates using systematic perturbations, requiring only function values. Classical KW uses coordinate-wise perturbations, while SPSA perturbs all coordinates simultaneously, leading to only two function evaluations per gradient estimate but increased estimator variance (Du-Yi et al., 2024).

Experimental results demonstrate that batch-based FD estimators (using multiple randomly chosen perturbation directions and batching for variance reduction) can substantially outperform traditional KW/SPSA, especially when moderate numbers of parallel evaluations are affordable (Liang et al., 28 Feb 2025). Adaptive control of perturbation size and batch size ensures low estimator bias/variance and robust convergence. In practical settings, batch-based FD methods with line search strategies deliver better efficiency and solution accuracy than minimal-sample KW/SPSA, at the cost of larger per-iteration sample complexity.

Direct search and pattern-based methods (e.g., coordinate pattern search, Nelder-Mead simplex) explore the parameter space using deterministic or stochastic geometrical patterns and are especially robust on non-smooth or noisy black-box problems (Bagci, 30 Dec 2025). They guarantee convergence to Clarke-stationary points under mild assumptions, but may suffer from poor scaling in high dimensions.

4. Population-Based and Estimation-of-Distribution Algorithms

Population-based algorithms operate by evolving a diverse set of candidate solutions via selection, recombination, mutation, and replacement. Classic representatives include genetic algorithms, DE, PSO, ES, CMA-ES, SCE, and their many variants (Zhang, 2019, Auger et al., 2010). For continuous, moderately high-dimensional problems (up to several hundred variables), DE and CMA-ES are particularly effective; CMA-ES demonstrates strong robustness to ill-conditioning, non-separability, and moderate noise, with favorable scaling up to $n \approx 1000$ (Auger et al., 2010).

Advanced EDAs replace explicit variation operators with probabilistic modeling of promising samples. Gaussian-based EDAs are traditional, but replacing Gaussian distributions with Student’s t (or mixtures) enables heavier-tailed proposals, improving exploration and reducing premature convergence in highly multimodal or deceptive landscapes. Mixtures of Student’s t distributions refined with EM and component pruning (EMSTDA) further improve robustness across a wide range of test problems, consistently outperforming Gaussian-based EDA counterparts (Liu et al., 2016).

5. Global Optimization via Bayesian and Lipschitzian Strategies

Bayesian optimization (BO) applies GP surrogates with acquisition functions (UCB, Expected Improvement) to systematically choose evaluation points that trade exploration and exploitation. BO amortizes function evaluations optimally when sample budgets are low and objective evaluations are costly. The computational bottleneck is kernel matrix inversion and acquisition maximization, limiting canonical BO to problems with dimension $\lesssim 20$ (Zhang, 2019).

Lipschitzian methods (e.g., DIRECT, LIPO, MCS) operate by partitioning the domain and employing analytically computable bounds derived from the Lipschitz constant to eliminate regions that cannot contain the global optimum. DIRECT generalizes to moderate dimensions ( $d \lesssim 10$ ) without explicit knowledge of the Lipschitz constant, using a potential-optimality criterion. Extensions for non-smooth, stepwise objectives (StepDIRECT) incorporate local variability and stochastic local search, outperforming classic DIRECT and metaheuristics in random-forest minimization and hyperparameter tuning tasks (Phan et al., 2022).

6. Learning-Based, Classification, and Subspace Approaches

Classification-based DFO algorithms (e.g., SRACOS, RACE-CARS) employ repeated classification of sampled points into positive (low-objective) and negative sets, training classifiers to carve away high-value regions from the search space and focusing subsequent sampling. Rigorous query complexity results relate the convergence rate to the "hypothesis–target shattering rate," and region-shrinking schemes (RACE-CARS) provably accelerate the approach, yielding strong empirical performance in high-dimensional nonconvex optimization and black-box LLM tuning (Han et al., 2023).

Relatedly, repeated classification with multiplicative weights and sublevel-set learning directly exploits active learning theory, achieving linear convergence rates under VC-dimension and disagreement coefficient assumptions. Practical implementations with random forests and parametric bootstrapping match or exceed evolutionary and Bayesian techniques on a range of synthetic and real-world tasks (Hashimoto et al., 2018).

Low-dimensional subspace strategies exploit the empirical observation that many objectives are optimizable with few effective degrees of freedom. Output-based or gradient-based dimension reduction identifies and updates a low-dimensional search subspace per iteration, solving reduced-size subproblems for efficiency and scalability. Model-based subspace trust-region frameworks (NEWUOAs, OMoRF) have demonstrated global convergence and scalability to dimension $n = 10^4$ while maintaining robustness against function inaccuracy (Zhang, 8 Jan 2025, Gross et al., 2020).

7. Practical Considerations and Comparative Performance

Empirical studies provide practical guidance and comparative rankings among major DFO families:

For smooth, well-conditioned, and approximately quadratic objectives, model-based (e.g., NEWUOA) and quasi-Newton methods (with accurate gradient surrogates) achieve rapid convergence with low evaluation counts (Auger et al., 2010).
CMA-ES consistently outperforms other general-purpose DFO algorithms in high-dimensional, ill-conditioned, or nonseparable continuous landscapes, with high robustness and moderate parameter sensitivity (Auger et al., 2010, Zhang, 2019).
DE and PSO are effective in continuous, real-valued settings but are more parameter sensitive and show poorer scaling.
Bayesian and Lipschitzian strategies dominate for expensive, moderate-dimensional global optimization, particularly when high-quality global minima are essential (Zhang, 2019, Phan et al., 2022).
Classification- and learning-based methods are particularly effective in batched, parallel, or constrained-evaluation settings, and scale efficiently to very high dimensions given suitable classifiers (Hashimoto et al., 2018, Han et al., 2023).
Pattern search and simplex methods provide robust local optimization in the absence of smoothness but scale poorly with dimension (Bagci, 30 Dec 2025).

Per-iteration complexity, memory usage, suitability to noise, and possibility for parallelization are critical in method selection. Hybridization—combining global search with local model-based or gradient-based refinement—is commonly practiced to exploit the complementary strengths of different DFO classes (Zhang, 2019).

8. Advances, Limitations, and Theoretical Guarantees

Recent advances include surrogate-aided trust-region updates with least $H^2$ -norm projection for improved average-region model fidelity, explicit characterization of optimality-preserving transformations in model-based DFO (Xie et al., 2023, Xie et al., 2023), accelerated batch-based finite-difference estimators (Liang et al., 28 Feb 2025), and scalable dimension-reduction frameworks (Zhang, 8 Jan 2025, Gross et al., 2020).

Convergence guarantees have been established in various regimes, including stationarity of accumulation points, global convergence under Kurdyka-Łojasiewicz (KL) property (for FD methods), provable rates for POISE-based trust-region IBO ( $O(n\epsilon^{-2})$ evaluations for first-order criticality), and probabilistic convergence under learning-theoretic and active-learning assumptions for classifier-based DFO.

Despite significant progress, challenges remain in scaling DFO to ultra-high dimensions, handling highly non-smooth or noisy objectives, and efficiently integrating multi-fidelity or constrained settings. Algorithmic choice must be tailored to the problem structure, smoothness, evaluation budget, and available parallelism. Ongoing research continues to improve theoretical guarantees, implementation robustness, and practical efficiency across the DFO landscape.