NAS-Bench-201: Tabulated NAS Benchmark

Updated 1 March 2026

NAS-Bench-201 is a fully tabulated neural architecture search benchmark that comprehensively evaluates 15,625 convolutional cell architectures on multiple image classification datasets.
It employs a compact, cell-based search space modeled as a directed acyclic graph, enabling fair comparisons across various NAS methods including zero-shot and hardware-aware techniques.
The benchmark offers detailed per-epoch training logs, resource metrics, and fine-grained performance data to support research in transferability, robustness, and algorithm efficiency.

NAS-Bench-201 is a fully tabulated neural architecture search (NAS) benchmark that defines a compact, cell-based search space and provides exhaustive training, validation, and test results for all architectures on multiple image classification datasets. By providing unified, reproducible evaluations of 15,625 convolutional architectures, along with per-architecture resource metrics and fine-grained training logs, NAS-Bench-201 has become a principal platform for benchmarking algorithmic advances, sample efficiency, search dynamics, and transferability in the NAS literature. Its adoption extends to robust NAS, hardware-aware NAS, zero- and one-shot NAS, train-free methods, and transfer learning studies.

1. Search Space Definition and Encoding

NAS-Bench-201 models a micro-cell as a directed acyclic graph (DAG) of four nodes, labeled {0, 1, 2, 3}, which correspond to the input, two intermediate computation nodes, and cell output, respectively. Each possible directed edge $(i{<}j)$ among these nodes is present, yielding six edges per cell. Each edge is assigned one operation from the set:

$\mathcal{O} = \{\texttt{zero (none)},\; \texttt{skip\_connect},\; \texttt{conv1x1},\; \texttt{conv3x3},\; \texttt{avg\_pool3x3}\}$

Consequently, the space contains $5^6 = 15\,625$ unique cell architectures (Dong et al., 2020, Lopes et al., 2023, Peng et al., 2022, Mills et al., 2021, Dudziak et al., 2020, Zhang et al., 2021, Dong et al., 2023, Wu et al., 2024).

Cells are encoded either as:

A six-tuple of operation indices (canonical order: $(0{\to}1), (0{\to}2), (1{\to}2), (0{\to}3), (1{\to}3), (2{\to}3)$ ).
An upper-triangular $4{\times}4$ operation-adjacency matrix $M$ , with $M_{ij}$ indicating the operation on $(i{<}j)$ and $M_{ij}=\emptyset$ for $i \geq j$ .

Networks are built by stacking a fixed number of these cells within a hard-coded macro-architecture: a $3{\times}3$ stem convolution, three stages of five repeated cells (stages separated by residual downsampling blocks), global average pooling, and a fully connected classifier.

2. Datasets, Training Protocols, and Metrics

Every architecture in NAS-Bench-201 is trained from scratch on three datasets under identical regimes:

Dataset	Train / Valid / Test splits	Input
CIFAR-10	25,000 / 25,000 / 10,000 ( $32{\times}32$ RGB)	$32{\times}32$
CIFAR-100	50,000 / 5,000 / 5,000 ( $32{\times}32$ RGB)	$32{\times}32$
ImageNet-16-120	151,700 / 3,000 / 3,000 ( $16{\times}16$ RGB)	$16{\times}16$

All architectures are trained for 200 epochs using SGD (Nesterov momentum $0.9$), cosine LR decay (start $0.1$ to $0$), batch size 256, weight decay $5{\times}10^{-4}$ , and fixed data augmentations (random crop/pad, flip). Each run is repeated with multiple random seeds.

Metrics recorded for every architecture include:

Train/valid/test accuracy and cross-entropy loss (per-epoch and final).
FLOPs, parameter count, 1080Ti batch-inference latency.
Fine-grained logs for learning curves, supporting predictors and early stopping.
For robust NAS (Wu et al., 2024), clean and adversarial (PGD/FGSM/AutoAttack) accuracies.

The standardized training, evaluation, and logging protocol enables direct, fair comparison between NAS algorithms.

3. Benchmark API and Workflow

The dataset is distributed as a single binary file for use with the NAS-Bench-201 Python API:

from nas_bench_201.api import NASBench201API
api = NASBench201API('path/to/NAS-Bench-201-v1_1_2019_05_28.pth')
info = api.get_more_info(index=1234, dataset='cifar10', hp='200', seed=777)
test_acc = info['test-accuracy']
valid_curve = info['valid-accuracy']

This API provides instant lookup of all relevant metrics for any architecture, enabling NAS algorithms to focus solely on the search logic without requiring retraining. It also supports querying per-epoch (short-run) results for fast bandit and learning curve extrapolation methods.

Suggested best practices include:

Always compare results to random search and across all three datasets.
Report detailed operation usage to avoid rediscovering trivial operation-dominated modes.
Use multiple search seeds for robustness.

4. Search Space Structure, Operation Effects, and Distributional Properties

The space is deliberately compact, allowing exhaustive evaluation and facilitating detailed statistical analysis (Lopes et al., 2023). Key findings highlight the strong structural bias:

Performance skewness: Validation/test accuracies are heavily negatively skewed; a large fraction of architectures cluster close to the upper-bound, with a long tail of poor performers.
Operation-conditioned performance: Architectures with more $3{\times}3$ convolutions yield higher mean accuracy. Other operations (avg-pool, 'none') consistently correlate with reduced accuracy. For instance, mean accuracy increases monotonically with $k$ (count of $3{\times}3$ convs), whereas additional avg-pool or zero operations strongly depress the median.
Edge-wise and pairwise effects: All edge positions exhibit this dominance, but skip-connections are particularly beneficial when positioned directly into the output node.
Cross-dataset generalizability: Kendall’s $\tau$ between CIFAR-10 and CIFAR-100 is $\sim0.75$ , and between CIFAR-10 and ImageNet-16-120 is $\sim0.60$ . Top architectures do not always transfer fully, but overall correlation is substantial.
Design considerations: Most top-performing cells are heavily dominated by $3{\times}3$ and $1{\times}1$ convolutions, with skip-connections providing secondary benefits. Overuse of trivial or pooling operations almost always leads to sub-optimal models.

μₒ (mean accuracy, presence of op $o$ )	3×3 Conv	1×1 Conv	Skip	AvgPool	None
CIFAR-10	87.6%	86.6%	83.8%	81.3%	79.0%
CIFAR-100	66.4%	65.0%	60.5%	57.8%	56.8%
ImageNet-16-120	38.3%	37.3%	33.8%	30.1%	29.8%

5. Impact on NAS Algorithm Development and Comprehensive Benchmarking

NAS-Bench-201 enables rigorously reproducible comparison of NAS algorithms across paradigms:

Non-weight-sharing methods (e.g., REA, RS, REINFORCE) outperform differentiable/weight-sharing ones on final test accuracy, due to the strong correlation between short and full training runs.
Differentiable NAS (e.g., DARTS, GDAS): Prone to collapse (e.g., all-skip cells), heavily sensitive to batch norm/statistics. Many such methods underperform random search unless adapted.
Zero-/one-shot and predictor-based NAS: GCN-based predictors, ranking distillation schemes (e.g., RD-NAS), and zero-cost proxies have been explicitly benchmarked, with RD-NAS showing notable improvements in ranking correlation of predicted versus true performance, especially when distillation from proxies is employed (Dong et al., 2023).
Hardware-aware and latency-constrained NAS: GCN-based latency predictors trained on LatBench (sub-benchmark for device runtime measurement) improve over analytic proxies (FLOPs, etc.) (Dudziak et al., 2020).

Detailed result tables support head-to-head comparison. For example, test accuracy performance on CIFAR-10 (mean ± std):

Method	Test Accuracy (%)
REA	93.92 ± 0.30
RS	93.70 ± 0.36
BOHB	93.61 ± 0.52
BaLeNAS-TF	94.33 ± 0.03
L²NAS-1k	94.28 ± 0.08
PRE-NAS	94.04 ± 0.34
Optimal	94.37

6. Extensions: Robustness, Device Adaptation, and Future Directions

Robust NAS: NAS-RobBench-201 (Wu et al., 2024) extends the original with adversarially trained results (clean and robust accuracies under PGD/FGSM/AutoAttack) for all non-isomorphic architectures, supporting systematic exploration of robust neural design. It enables fast, reproducible robust-NAS evaluation and supports the development/training of robust NTK-based NAS proxies.
Hardware-aware NAS: LatBench (Dudziak et al., 2020) augments each architecture with latency measurements across six diverse hardware targets (desktop, embedded, mobile), exposing weak correlation between FLOPs and real inference time and facilitating device-specific architecture optimization.
Open questions and limitations: The fixed cell DAG and limited operator set restrict innovation in macro-architecture or novel operator classes. The transferability of discovered architectures between datasets remains significant but imperfect, especially for top-k models. Generalization theory for robust NAS under adversarial training is now grounded via NTK analysis (Wu et al., 2024), but feature-learning regimes and novel model classes (e.g., transformers, RNNs) are not yet covered.

Areas for extension include: macro-architecture search, joint NAS+HPO, tabular semantic segmentation/detection/LTP, explicit robust transfer search, and benchmarking with extended operator sets.

7. Usage Recommendations and Best Practices

Efficient and fair benchmarking with NAS-Bench-201 should follow these guidelines:

Benchmark search performance independently on all supported datasets, not just CIFAR-10.
Always report random search baselines to contextualize gains.
Quantify the operation composition of discovered architectures to detect overfitting to trivial subspaces (e.g., "C₃-everywhere").
For robust NAS, validate on strong held-out adversaries, not just PGD.
Exploit per-epoch logs for learning curve modeling and proxy-based early termination.
For latency- or hardware-constrained NAS, train or use device-specific predictors rather than relying solely on FLOPs.

By adhering to these standards, researchers can ensure rigor, reproducibility, and actionable comparison across the full spectrum of NAS algorithms and objectives (Dong et al., 2020, Lopes et al., 2023, Peng et al., 2022, Mills et al., 2021, Zhang et al., 2021, Dudziak et al., 2020, Dong et al., 2023, Wu et al., 2024).