Derivative-free Neural Architecture Search

Updated 8 December 2025

Derivative-free NAS is a set of methods that automate neural network design without using gradient information, leveraging black-box optimization like Bayesian and evolutionary algorithms.
These techniques effectively navigate complex, discrete search spaces, avoiding pitfalls of continuous relaxation and overfitting inherent in differentiable methods.
They offer practical benefits in sample efficiency, transferability, and scalability, enabling rapid prototyping of high-performance architectures.

Derivative-free Neural Architecture Search (NAS) encompasses a class of methods for automating the discovery of neural network topologies without reliance on gradient-based optimization of architecture parameters. Instead, these approaches typically treat the architecture search space as a black box, leveraging strategies such as Bayesian optimization, reinforcement learning, evolutionary algorithms, zero-order optimization, or learning-based surrogates that do not require derivatives with respect to architectural choices. Derivative-free NAS has gained substantial attention due to its robustness to the pitfalls of continuous relaxation and overfitting present in differentiable NAS, and its capacity to handle broad, complex, or discrete search spaces in a statistically principled manner.

1. Problem Formalization and Scope

Neural architecture search consists of optimizing a network architecture $A$ —represented as a directed acyclic graph (DAG) over nodes (computational units) and edges (connections/operations)—to maximize predictive performance under given resource constraints. Derivative-free NAS reframes architecture discovery as a black-box optimization or search problem over the discrete or continuous parameters specifying $A$ , bypassing the need for gradients of the validation loss $\mathcal{L}_\mathrm{val}$ with respect to these parameters. Formally, given architecture parameters $\theta$ (possibly encoding wiring, operation distribution, topology, or generator hyperparameters), and an objective function $f(\theta)$ —often comprising validation accuracy, resource cost, or composite multi-objective function—derivative-free NAS proceeds by iterative acquisition and evaluation of architecture candidates via strategies that do not depend on $\partial f/\partial\theta$ .

2. Methodologies in Derivative-free NAS

Derivative-free NAS encompasses several algorithmic paradigms, each characterized by its approach to exploring and exploiting the architecture search space:

2.1 Bayesian Optimization over Generator Hyperparameters

Neural Architecture Generator Optimization (NAGO) casts NAS as low-dimensional continuous optimization by defining a hierarchical stochastic network generator $G(\theta)$ parameterized by continuous hyperparameters $\theta$ (topology, operation mix, resolution/channel allocation, and merge strategies). The search proceeds via Bayesian optimization (BO) over $\theta$ , employing a heteroscedastic Bayesian Neural Network surrogate to model $f(G(\theta))=(a(\theta),-m(\theta))$ —a tuple of accuracy and resource usage. Acquisition functions such as expected hypervolume improvement facilitate exploration-exploitation trade-offs. Both single-objective (BOHB) and multi-objective (MOBO) BO variants are used, supported by local penalization for batch selection. This methodology reduces the architecture search to an 8–15-dimensional continuous hypercube, permitting efficient Pareto-optimal trade-off discovery in accuracy vs. model size/latency (Ru et al., 2020).

2.2 Zero-order Optimization in Continuous Relaxations

ZARTS eschews gradient approximations inherent to DARTS, replacing them with genuine zeroth-order optimization. It operates directly on the continuous architecture weighting vector $\alpha$ , minimizing $f(\alpha)=\mathcal{L}_\mathrm{val}(w^*(\alpha),\alpha)$ via zero-order estimators:

Random Search (RS): Randomly samples directions $u_i$ , approximating partial derivatives via symmetric finite differencing.
Model-based Gaussian Smoothing (MGS): Defines an importance-weighted distribution over candidate steps $u$ based on their functional improvement, estimating step direction through importance sampling.
GradientLess Descent (GLD): Directly selects among a finite set of candidate perturbations via function value comparisons.

This approach yields theoretical guarantees that, in the small step-size regime, these methods recover SGD dynamics for $f(\alpha)$ . ZARTS demonstrates high robustness against differentiability violations and search space pathologies (Wang et al., 2021).

2.3 Reinforcement Learning in a Continuous Action Space

L²NAS frames NAS as continuous-action reinforcement learning, with the architecture parameter $\alpha_t$ operated as the agent's action. The actor—an MLP—outputs $\alpha_t$ based on a statistical state summary $s_t$ of recent high-performing architectures. The critic is optimized to a high conditional quantile (altogether avoiding mean regression), resulting in exploration and policy updates that focus on the high reward tail. This setup ensures derivative-free learning, as the policy is never updated via architectural validation gradients; exploration is maintained through $\epsilon$ -greedy and additive uniform noise (Mills et al., 2021).

2.4 Bandit, Evolutionary, and Surrogate-based Methods

Combinatorial Multi-Armed Bandit (CMAB-NAS): The cell search space is decomposed into local bandit problems per node, with rewards approximated as additive. Nested Monte-Carlo Search (NMCS) is used for efficient exploration, guided by upper confidence bound (UCB) strategies locally, and top-k selection globally. CMAB-NAS achieves state-of-the-art accuracy at an order of magnitude lower search cost compared to tree search and differentiable methods (Huang et al., 2021).
Predictor-guided Evolution (NPENAS): Evolutionary search is accelerated by a graph neural network surrogate—either a Bayesian uncertainty estimator used with Thompson sampling (NPENAS-BO) or a direct MSE-trained predictor (NPENAS-NP)—to rank large batches of mutated offspring. Uniform hash-based candidate generation over the enumerated search space ensures diversity and unbiased proposal distribution (Wei et al., 2020).

2.5 Zero-cost Proxies and Search-free Synthesis

NAS Without Training (NASWOT): Architecture candidates are scored via the log-determinant of the ReLU activation overlap kernel on random initialization, which is strongly indicative of trained accuracy. This enables near-instantaneous sample-and-select search or integration into regularized evolution, providing competitive performance at sub-second runtime (Mellor et al., 2020).
Search-free GNN-based Synthesis: Architecture generation is cast as probabilistic DAG edge selection using a graph neural network link predictor, trained on positives (edges from high-performing architectures) and negatives. New architectures are sampled sequentially by maximizing predicted link probabilities under DAG and resource constraints, eliminating the need for search or objective queries altogether. This method yields competitive architectures and demonstrates high transferability across search spaces (e.g., NAS-Bench-101 to DARTS) (Liang et al., 2022).

3. Search Space Design and Representational Abstractions

Derivative-free NAS methodologies vary greatly in how they encode and navigate the search space:

Generator-based Hierarchies: NAGO's generator unifies a broad class of architecture motifs via Watts–Strogatz and Erdős–Rényi random graphs at top-, mid-, and bottom-levels, controlled via a small continuous parameterization $\theta$ to allow coverage of $O(10^{56})$ distinct DAGs within a tractable optimization domain (Ru et al., 2020).
Cell-level Decomposition: CMAB-NAS and other cell-based approaches represent network topologies as graphs over submodules (cells), with nodes encoding input-source/operation pairs and edges reflecting functional composition (Huang et al., 2021).
Encoding for Predictors: NPENAS and link prediction approaches employ DAG encodings augmented with one-hot operation features, supporting prediction with graph neural networks or attention modules (Wei et al., 2020, Liang et al., 2022).
Continuous Relaxations: In ZARTS, DARTS, and L²NAS, architecture parameters are relaxed to continuous vectors amenable to zero-order or reinforcement-based search, then discretized for final network instantiation (Wang et al., 2021, Mills et al., 2021).

4. Empirical Benchmarks, Efficiency, and Practical Implications

Derivative-free NAS methods have been extensively validated on canonical NAS benchmarks (CIFAR-10/100, NAS-Bench-101/201, ImageNet, NATS-Bench, DARTS spaces):

Sample efficiency: NASWOT and link-predictor-based synthesis match or approach SOTA accuracy using 0.0002–0.02 GPU-days; NPENAS and CMAB-NAS deliver results with a fraction of the queries and compute required by differentiable or reinforcement-based NAS.
Accuracy and transferability: On DARTS spaces, ZARTS-MGS achieves 97.54% CIFAR-10 and 75.7% ImageNet top-1; NPENAS-NP obtains 2.44% best CIFAR-10 error. Generator-based approaches such as NAGO deliver lightweight, high-accuracy models (e.g., 96.6% CIFAR-10 at 17 MB). Search-free link prediction methods transfer effectively across datasets and cell design paradigms (Wang et al., 2021, Wei et al., 2020, Liang et al., 2022, Ru et al., 2020).

Method	Arch. Param.	Principle	GPU-days	Benchmarks	SOTA Accuracy
NAGO	Generator θ	Bayesian Opt.	12.8	CIFAR, ImageNet	96.6% (C10)
ZARTS (MGS)	α (continuous)	Zero-order Opt.	1.0	CIFAR, ImageNet	97.54% (C10)
L²NAS	α (continuous)	RL (Actor–Critic)	0.1–3	DARTS, NAS201	97.51% (C10)
CMAB-NAS	Cell (DAG)	Bandit (NMCS)	0.58	CIFAR, ImageNet	2.58% err (C10)
NPENAS-NP	DAG	EA + predictor	1.8	NB201, DARTS	2.44% err (C10)
NASWOT	Any	Zero-cost proxy	<0.01	NB101, NB201, NATS	93.84% (NB201 C10)
Link-predictor (Liang et al., 2022)	DAG	GNN, no search	0.0002	NB101, DARTS	97.82% (C10)

All values as reported in the referenced works; C10 = CIFAR-10

5. Advantages, Limitations, and Robustness Analysis

Derivative-free NAS approaches consistently demonstrate:

Robustness to differentiability violations and overfitting pathologies present in gradient-based methods; for instance, ZARTS avoids collapse in adversarial search spaces where DARTS fails.
Versatile search space coverage: Hierarchical generator and link prediction methods generalize across highly diverse architecture families, not limited by hand-crafted cell templates.
Statistical efficiency: Low-dimensional continuous searches (NAGO, L²NAS) and fast-proxy-based methods (NASWOT, AREA) prune the architecture space orders of magnitude faster than exhaustive or fully-trained search, with minimal drop in accuracy.
Transferability: Surrogate- and link-predictor-based methods learn generic motifs, supporting architecture generation across tasks and datasets with little to no retraining.

Limitations include:

Computational cost overhead: Zeroth-order methods such as ZARTS can incur increased cost due to repeated inner-loop weight training for function evaluations (although parallelization and surrogate-integration can mitigate this).
Scaling to unbounded or unstructured architecture spaces: Link predictor and surrogate-based techniques may require re-training or experience buffering to capture new macro-structures absent from the initial data.
Lack of global property control: Search-free generation methods (e.g., link prediction) may not enforce constraints on network depth, diameter, or specific resource budgets unless explicitly encoded.

6. Future Directions and Open Problems

Key areas for further research in derivative-free NAS include:

Integration of multi-objective constraints: Beyond accuracy and parameter count, incorporating latency, energy, and deployment-aware constraints in the acquisition/selection strategies.
More expressive surrogates: Combining graph-based predictors with zero-cost proxies or uncertainty quantification to balance exploitation and exploration in large or open-ended spaces.
Joint end-to-end predictors: Unifying architecture scoring with link or motif prediction to directly optimize for trained performance in a closed loop.
Scalability and generalization: Scaling search-free generative approaches and robust zero-order methods to ResNet/DenseNet-sized macro-architectures, potentially requiring reinforcement or meta-learning extensions.

Derivative-free NAS continues to achieve state-of-the-art results by harnessing black-box optimization, bandit, proxy, and learning-based design synthesis strategies, providing resilience to the pitfalls of differentiable search and enabling rapid, robust, and transferable neural architecture design (Ru et al., 2020, Mills et al., 2021, Wang et al., 2021, Mellor et al., 2020, Huang et al., 2021, Wei et al., 2020, Liang et al., 2022).