HW-NAS-Bench: Hardware-Aware NAS Benchmark
- HW-NAS-Bench is a standardized benchmark suite providing device-measured latency, energy, and accuracy metrics to enable fair hardware-aware neural architecture search.
- The benchmark consolidates multiple NAS search spaces and profiles architectures on six real hardware platforms using standardized Python APIs for instant metric lookup.
- Empirical findings reveal that conventional proxies like FLOPs correlate poorly with actual device costs, emphasizing the need for measured performance metrics.
HW-NAS-Bench is a standardized public benchmark suite for hardware-aware neural architecture search (HW-NAS) that provides measured or reliably estimated hardware performance (latency, energy) for a comprehensive set of neural network architectures evaluated across multiple real platforms. The benchmark was introduced to enable reproducible, fair, and device-aware NAS by eliminating the need for custom cost modeling or hardware profiling and by consolidating multiple widely-used search spaces and a diverse set of hardware back-ends. HW-NAS-Bench has catalyzed algorithmic innovation, facilitated LLM-powered NAS pipelines, guided zero-cost proxy research, and has established itself as a reference standard for both neural and hardware co-design methods (Li et al., 2021, Capuano et al., 1 Apr 2025, Zhu et al., 1 Oct 2025, Ahmad et al., 2024).
1. Motivation and Design Scope
HW-NAS-Bench was conceived to address persistent obstacles in HW-NAS methodology: the prohibitive expertise and infrastructure needed to collect device-specific cost tables or measurement protocols, and the lack of unified, reproducible baselines due to variation in adopted search spaces and hardware targets. It offers:
- Device-measured or accurately simulated hardware-costs (latency, energy) for all architectures in major NAS search spaces (e.g., NAS-Bench-201, FBNet, and MnasNet derivatives), spanning six commercial and academic platforms: NVIDIA Jetson TX2, Raspberry Pi 4, Google Edge TPU, Pixel 3, Xilinx FPGA, and Eyeriss ASIC.
- Standardized Python APIs enabling researchers to instantly retrieve accuracy, latency, and energy metrics for thousands to quadrillions of architectures, enabling hardware-constrained or bi-objective NAS on real device profiles with no extra measurement burden.
- Explicit analysis and proof that theoretical proxies (FLOPs, parameter counts) correlate poorly with true device costs (with observed Kendall’s on key platforms), necessitating reliance on measured values (Li et al., 2021, Ahmad et al., 2024).
The benchmark is structured to democratize HW-NAS research, support non-hardware experts, and enable apples-to-apples comparisons across methods, search spaces, and hardware devices.
2. Search Spaces and Architectural Coverage
HW-NAS-Bench aggregates measured performance data for rich, well-characterized search spaces:
NAS-Bench-201 Cell Space:
- 46,875 unique architectures constructed by stacking identical “cells,” each a 4-node DAG, with operations selected from {1×1 conv, 3×3 conv, 1×3→3×1 separable conv, 3×1→1×3 separable conv, 3×3 max-pool}; evaluated on CIFAR-10, CIFAR-100, and ImageNet16-120. Includes full training logs and measured latency/energy on all devices (Li et al., 2021).
FBNet Layer-wise Space:
- Fixed macro-architecture of 22 searchable blocks, each selectable among 9 operator variants (kernel/expansion/width); the full design space is networks, evaluated with blockwise cost composition (Li et al., 2021).
MnasNet (Accel-NASBench/Extended):
- Hierarchical, block-based parameterization with architectures over 7 sequential stages, each block parameterized by expansion factor, depth, kernel size, and squeeze-excitation; designed to cover realistic, hardware-friendly CNN regimes for large-scale benchmarks (ImageNet2012) (Ahmad et al., 2024).
A summary of these spaces and operators:
| Search Space | Operators | # Architectures | Datasets |
|---|---|---|---|
| NAS-Bench-201 | 1x1, 3x3 convs, separables, pool | 46,875 | CIFAR-10/100,ImNet16-120 |
| FBNet | 9 block types (k,s,e,w) | ImageNet | |
| NATS-Bench (PEL-NAS) | nor_conv_3×3/1×1, avg_pool, skip | 15,625 | CIFAR/ImageNet16-120 |
| MnasNet (Accel-NASBench) | e (1,4,6), L (1,2,3), k (3,5), se | ImageNet2012 |
Device latencies and/or energy are provided at per-architecture or per-block granularity, with summation schemes empirically validated to reliably approximate end-to-end inference cost (block-wise sum, with Pearson , Kendall in FBNet) (Li et al., 2021).
3. Hardware Platforms and Performance Metrics
HW-NAS-Bench explicitly profiles each architecture (or block) on six heterogeneous hardware platforms, chosen to span the spectrum of practical deployment targets and research back-ends:
- Edge GPU: NVIDIA Jetson TX2, profiled with TensorRT.
- Raspberry Pi 4: ARM Cortex-A72, profiled with TFLite.
- Edge TPU: Google’s ML ASIC, profiled post-compile via TFLite/Edge TPU stack.
- Pixel 3: Mobile CPU + NN hardware, profiled with TFLite benchmark.
- FPGA: Xilinx ZC706/ZCU102/VCK190, measured on real board with Vivado HLS or Vitis AI stack.
- ASIC: Eyeriss accelerator, both simulated (Accelergy+Timeloop) and analytically predicted (DNN-Chip Predictor); average result forms final estimate (Li et al., 2021, Ahmad et al., 2024).
Collected metrics include:
- Latency: (ms), direct on-device measurement or post-compilation prediction, average over multiple runs or inference passes.
- Energy per inference: (mJ), via power-rail instrumentation where possible or simulation.
- Throughput: (images/sec), for MnasNet/Accel-NASBench-derived spaces (Ahmad et al., 2024).
Device-specific methods for cost estimation include LUT-based lookup, blockwise additive synthesis, and simulation-analytical hybrid averaging for ASIC. This ensures access to all hardware costs, enabling rapid iteration (Li et al., 2021).
4. Benchmark Construction, Data Access, and Methodological Impact
The HW-NAS-Bench protocol involves exhaustive or quasi-exhaustive sampling of the search space, full training of each candidate (for accuracy), and measurement or simulation of inference latency/energy on each device. In the Accel-NASBench variant, large-scale ImageNet2012 coverage is achieved via optimized proxy training schemes, yielding high Kendall’s to full training but at 5.6× reduced cost (Ahmad et al., 2024).
The dataset is distributed with Python APIs and REST-like interfaces, providing instant access to tuple metrics (accuracy, latency, energy) per candidate. Typical workflows integrate HW-NAS-Bench lookups with search algorithms (e.g., evolutionary NAS, ProxylessNAS, LLM-driven mutation), allowing the objective
0
to be optimized via direct querying (Li et al., 2021).
Pseudocode usage (abbreviated):
7
No architecture training or measurement is necessary at search-time, making device-specific HW-NAS fully accessible to non-experts (Li et al., 2021, Capuano et al., 1 Apr 2025, Zhu et al., 1 Oct 2025).
5. Analytical Insights and Empirical Findings
HW-NAS-Bench enables several critical empirical conclusions:
- FLOPs/#Params as Poor Hardware Proxies: Observed Kendall 1 correlations between FLOPs/#params and measured cost are consistently weak (<0.5) or near-zero across major devices, establishing that real device profiles are required for valid HW-NAS (Li et al., 2021, Ahmad et al., 2024).
- Device-Specificity of Pareto Fronts: The set of Pareto-optimal architectures in the accuracy–latency tradeoff is device-dependent; an architecture optimal on Edge GPU may be strictly suboptimal on Edge TPU or ASIC, with cross-device Kendall 2 as low as 0.0 or negative (Li et al., 2021, Ahmad et al., 2024).
- Blockwise Cost Additivity: For FBNet and MnasNet space, block-sum latency closely matches full-device measurements (Pearson 3) (Li et al., 2021).
- Rapid and Fair Comparison: Search methods can be benchmarked in seconds, ensuring parity with respect to search space, hardware targets, and cost metrics.
Example benchmark results (Edge GPU, FBNet space): searching for accuracy–latency tradeoffs via ProxylessNAS delivers distinct architectures optimal for each hardware target, directly queried via HW-NAS-Bench (Li et al., 2021).
6. Downstream Adoption and Innovations Enabled
HW-NAS-Bench forms the principal evaluation bed for novel NAS and LLM-driven hardware search paradigms:
- Meta-training on Synthetic Devices: In "Sim-is-More," RL controllers are trained on random Gaussianizations of HW-NAS-Bench’s operator-wise latency statistics, producing policies that efficiently adapt to novel devices with a handful of real latency queries, outperforming single-device RL and random baselines in both accuracy (+2.9%) and latency (−19.5%) (Capuano et al., 1 Apr 2025).
- LLM-Driven Co-Evolutionary NAS: PEL-NAS partitions the search space by architectural complexity, applies an LLM-powered, knowledge-base updating evolutionary loop, and efficiently navigates the HW-NAS-Bench space using zero-cost predictors and lookup tables, surpassing prior supernet and LLM-driven baselines in both Pareto front hypervolume and search efficiency (up to 53.6% lower IGD, 80.6% higher HV vs. LLMatic) (Zhu et al., 1 Oct 2025).
- Large-Scale, Bi-Objective Accelerator Benchmarks: Accel-NASBench (termed HW-NAS-Bench in some contexts) offers surrogate models for both ImageNet2012 top-1 accuracy and device throughput/latency, with XGBoost regressors achieving 4, Kendall’s 5 across held-out models and platforms (Ahmad et al., 2024).
A representative table of cross-benchmark characteristics:
| Benchmark | Search Space | Devices | Metrics Provided | Surrogate 6 |
|---|---|---|---|---|
| HW-NAS-Bench | NAS-Bench-201, FBNet | 6 | Acc., Latency, Energy | N/A (direct) |
| Accel-NASBench | MnasNet | 7 (GPU/TPU/FPGA) | Acc., Throughput, Latency | >0.98 |
7. Impact and Future Directions
HW-NAS-Bench has established standardized empirical baselines and protocols for HW-NAS, enabling methodologically sound, reproducible studies. The dataset’s device spectrum, search space diversity, and multi-metric coverage have catalyzed advances in meta-learning, zero-cost proxy research, and LLM-driven design. The platform illuminates the necessity of per-device optimization, dispels the validity of theoretical proxies, and encourages further inclusion of emerging hardware, scalable training proxies, and real-time hardware–software co-design loops. Recent work points to opportunities in auto-discovery of partitioning schemes, adaptive profiling integration, and domain-general LLM agents leveraging HW-NAS-Bench as a real-time oracle (Li et al., 2021, Capuano et al., 1 Apr 2025, Zhu et al., 1 Oct 2025, Ahmad et al., 2024).