NAS Benchmarks Overview

Updated 1 February 2026

NAS Benchmarks are standardized resources that provide precomputed accuracy, learning curves, and hardware metrics for efficient neural architecture search evaluation.
They encompass diverse formats—tabular, surrogate, and multi-domain suites—enabling rigorous comparisons across vision, NLP, graph, and energy-aware domains.
NAS Benchmarks support multi-objective analysis by integrating trade-offs between accuracy, energy consumption, and latency, fostering sustainable and reproducible research.

Neural Architecture Search (NAS) benchmarks are standardized resources that enable reproducible, fair, and efficient evaluation of NAS algorithms by providing lookup-access to the accuracy, learning curves, and often hardware/energy metrics of neural network architectures across fixed search spaces and datasets. These benchmarks have fundamentally transformed NAS research by removing the prohibitive cost of repeated retraining and by supporting rigorous comparisons across search strategies, surrogate models, and performance objectives in vision, language, graph, and hardware-aware domains.

1. Foundations and Motivation

The primary motivation for NAS benchmarks is to overcome the computational barrier inherent in neural architecture search, where evaluating each architecture by training to convergence is expensive or infeasible in large search spaces. Early tabular benchmarks such as NAS-Bench-101, NAS-Bench-201, and NAS-Bench-NLP exhaustively trained all architectures in small, cell-based domains (e.g., CIFAR-10 for vision, Penn Treebank for NLP) and recorded their performance under fixed hyperparameters, enabling queries of architecture metrics in constant time (Mehta et al., 2022, Klyuchnikov et al., 2020, Dong et al., 2020). This promoted methodological rigor by standardizing dataset splits, training protocols, and metrics, supporting reproducibility and fair head-to-head comparison.

However, exhaustive tabularization does not scale to larger or more diverse spaces (e.g., DARTS ≈10¹⁸, FBNet ≈10²¹, RNN DAGs ≈10⁵³ architectures), nor to domains beyond classification, or to settings requiring resource/energy metrics (Zela et al., 2020, Lopes et al., 2023). These limitations motivated surrogate benchmarks, multi-domain suites, and new protocols to capture hardware, efficiency, and transferability.

2. Types of NAS Benchmarks

NAS benchmarks now span a spectrum of formats and domains:

Tabular Benchmarks: Exhaustively trained datasets where each architecture’s metric(s) are precomputed and directly queryable. Examples:
- NAS-Bench-101 and NAS-Bench-201: Convolutional cell DAG spaces, up to ≈4×10⁵ models for CIFAR-10/100/ImageNet16 (Mehta et al., 2022, Dong et al., 2020).
- NATS-Bench: Adds macro-level variation (width, depth) and supports larger search spaces and more datasets (Dong et al., 2020).
- NAS-Bench-NLP: RNN cell DAGs for language modeling—a search space of ≈10¹⁶–10⁵³, with 14×10³ actually trained (Klyuchnikov et al., 2020).
- NAS-Bench-Graph: Node-classification on 9 datasets with 26,206 GNN architectures (Qin et al., 2022).
Surrogate Benchmarks: Machine-learned predictors (e.g., GIN, LightGBM, XGBoost) trained on a large but tractable subset and released as fast, queryable oracles—enabling search spaces up to ≈10¹⁸ (DARTS), 10²¹ (FBNet) (Zela et al., 2020, Yan et al., 2021). Surrogates estimate full learning curves, not just endpoint accuracies.
Multi-Domain Suites: Suites such as NAS-Bench-Suite aggregate tabular and surrogate APIs for >25 domains covering vision, NLP, speech, object detection, segmentation, and hardware (Mehta et al., 2022). NAS-Bench-360 extends this concept to highly diverse tasks from genomics to audio to PDEs, capturing cross-domain robustness and transferability (Tu et al., 2021).
Hardware- and Energy-Aware Benchmarks: Recent benchmarks such as EC-NAS, Accel-NASBench, AnalogNAS-Bench pair accuracy metrics with direct or surrogate-based estimates of energy consumption, device latency, or analog non-idealities (Bakhtiarifard et al., 2022, Ahmad et al., 2024, Bessalah et al., 23 Jun 2025). These support truly multi-objective NAS (Pareto search over accuracy, energy, latency), exposing trade-offs directly relevant to sustainable and hardware-optimized model design.

3. Key Methodological Principles

Benchmark construction and use is guided by strict principles to guarantee utility and scientific rigor:

Standardized Protocols: All benchmarked architectures are trained with fixed (public) data splits, hyperparameters, and seeds to eliminate confounding variability (Dong et al., 2020, Mehta et al., 2022).
Coverage and Diversity: Well-designed benchmarks span a wide spectrum of model sizes, structures, and (for energy-aware variants) hardware usage profiles, avoiding bias to specific regions of the space (Kocher et al., 21 May 2025).
Multi-fidelity Support: Learning-curve and per-epoch metrics (not just final accuracy) enable development of multi-fidelity NAS methods—e.g., early stopping, learning curve extrapolation, freeze-thaw acquisition (Yan et al., 2021).
Energy/Hardware Fidelity: For energy-aware or hardware-aware NAS, accuracy of per-epoch power measurements, capture of full-device energy (not just GPU), and coverage across device utilization levels are essential (Kocher et al., 21 May 2025, Bakhtiarifard et al., 2022, Ahmad et al., 2024).

4. Benchmark Usage and Empirical Findings

Benchmarks provide not only ready-made ground-truth for algorithm evaluation but also reveal intrinsic structure in architectural spaces:

API and Integration: Most benchmarks provide Python APIs for querying single architectures by hash/index, sampling from the space, or extracting Pareto frontiers for multi-objective evaluation. Surrogates can be embedded into NAS workflows to replace live training-and-evaluate cycles (Bakhtiarifard et al., 2022, Yan et al., 2021, Mehta et al., 2022).
Algorithm Evaluation: Side-by-side evaluation of NAS algorithms (random search, regularized evolution, local search, BOHB, one-shot methods, multi-fidelity methods) on unified benchmarks reveals that claims from a single domain often do not generalize—hyperparameter robustness and optimizer ranking are highly search-space dependent (Mehta et al., 2022, Loya et al., 2023).
Architectural Insights: Statistical analyses (e.g., operation importance, rank correlation, performance skewness) uncover which operations/structures yield best results and which benchmarks offer discriminative value (e.g., high skewness, highly correlated operations are less informative for comparing algorithm performance) (Lopes et al., 2023).
Transfer and Meta-Learning: Recent studies show that meta-learned predictors can transfer across benchmarks, encoding inductive biases; however, cross-domain adaptation depends critically on benchmark diversity and correlation (Loya et al., 2023).
Multi-objective Optimization: Energy-aware and hardware-targeted benchmarks enable empirical Pareto-front extraction, revealing potential for models with minimal energy or latency loss relative to accuracy, and supporting sustainable AutoML (Bakhtiarifard et al., 2022, Ahmad et al., 2024, Kocher et al., 21 May 2025).

5. Advances and Limitations

Recent advances have extended NAS benchmarks in several directions:

GraphNAS and Analog-Aware NAS: The introduction of benchmarks for GNN search (NAS-Bench-Graph) and analog in-memory computing (AnalogNAS-Bench) has broadened applicability to new domains and hardware, exposing failure modes of conventional proxies (e.g., quantization) for hardware-robustness assessment (Qin et al., 2022, Bessalah et al., 23 Jun 2025).
Scaling to Larger Spaces: Surrogate benchmarks (e.g., NAS-Bench-301, NAS-Bench-x11) and proxy-based training schemes (Accel-NASBench) address tractability in macro search spaces of >10¹⁸ architectures, preserving ranking fidelity while reducing data collection cost by 5-6× (Zela et al., 2020, Ahmad et al., 2024, Yan et al., 2021).
Energy and Environmental Reporting: Guidelines for reliable device-level measurement, wide-spread GPU utilization, and holistic (whole-node) reporting have emerged as standards for new energy-aware resources (Kocher et al., 21 May 2025).

Limitations persist: Search spaces remain much smaller than real networks, cross-benchmark rank transfer remains imperfect (task/domain correlation is variable), and energy/latency surrogates depend on accurate hardware profiling and calibration protocols (Kocher et al., 21 May 2025).

6. Impact and Practical Guidelines

NAS benchmarks have become indispensable tools for rigorous NAS research. Best practices derived from recent benchmarks include:

Always evaluate NAS algorithms on multiple distinct benchmarks and task domains to ensure claims generalize; avoid hyperparameter transfer without per-benchmark tuning (Mehta et al., 2022, Loya et al., 2023).
Prefer benchmarks with standardized protocols, rich per-architecture logs (including learning curves), and, for hardware-aware use-cases, device-validated measurement pipelines (Yan et al., 2021, Kocher et al., 21 May 2025).
For energy-aware research, report both GPU and holistic node-level energy, together with calibration runs, to ensure measurement traceability and enable meaningful Pareto-front extraction (Kocher et al., 21 May 2025, Bakhtiarifard et al., 2022).
Leverage multi-fidelity and multi-objective benchmarks to accelerate algorithmic innovation, support environmental sustainability, and ground empirical NAS research in realistic deployment settings (Ahmad et al., 2024, Bakhtiarifard et al., 2022).

The proliferation of well-designed NAS benchmarks has catalyzed a methodological shift, enabling robust, generalizable, and computationally sustainable progress in neural architecture search.