Papers
Topics
Authors
Recent
Search
2000 character limit reached

HW-NAS-Bench: Hardware-Aware NAS Benchmark

Updated 24 April 2026
  • HW-NAS-Bench is a standardized benchmark suite providing device-measured latency, energy, and accuracy metrics to enable fair hardware-aware neural architecture search.
  • The benchmark consolidates multiple NAS search spaces and profiles architectures on six real hardware platforms using standardized Python APIs for instant metric lookup.
  • Empirical findings reveal that conventional proxies like FLOPs correlate poorly with actual device costs, emphasizing the need for measured performance metrics.

HW-NAS-Bench is a standardized public benchmark suite for hardware-aware neural architecture search (HW-NAS) that provides measured or reliably estimated hardware performance (latency, energy) for a comprehensive set of neural network architectures evaluated across multiple real platforms. The benchmark was introduced to enable reproducible, fair, and device-aware NAS by eliminating the need for custom cost modeling or hardware profiling and by consolidating multiple widely-used search spaces and a diverse set of hardware back-ends. HW-NAS-Bench has catalyzed algorithmic innovation, facilitated LLM-powered NAS pipelines, guided zero-cost proxy research, and has established itself as a reference standard for both neural and hardware co-design methods (Li et al., 2021, Capuano et al., 1 Apr 2025, Zhu et al., 1 Oct 2025, Ahmad et al., 2024).

1. Motivation and Design Scope

HW-NAS-Bench was conceived to address persistent obstacles in HW-NAS methodology: the prohibitive expertise and infrastructure needed to collect device-specific cost tables or measurement protocols, and the lack of unified, reproducible baselines due to variation in adopted search spaces and hardware targets. It offers:

  • Device-measured or accurately simulated hardware-costs (latency, energy) for all architectures in major NAS search spaces (e.g., NAS-Bench-201, FBNet, and MnasNet derivatives), spanning six commercial and academic platforms: NVIDIA Jetson TX2, Raspberry Pi 4, Google Edge TPU, Pixel 3, Xilinx FPGA, and Eyeriss ASIC.
  • Standardized Python APIs enabling researchers to instantly retrieve accuracy, latency, and energy metrics for thousands to quadrillions of architectures, enabling hardware-constrained or bi-objective NAS on real device profiles with no extra measurement burden.
  • Explicit analysis and proof that theoretical proxies (FLOPs, parameter counts) correlate poorly with true device costs (with observed Kendall’s τ<0.5\tau <0.5 on key platforms), necessitating reliance on measured values (Li et al., 2021, Ahmad et al., 2024).

The benchmark is structured to democratize HW-NAS research, support non-hardware experts, and enable apples-to-apples comparisons across methods, search spaces, and hardware devices.

2. Search Spaces and Architectural Coverage

HW-NAS-Bench aggregates measured performance data for rich, well-characterized search spaces:

NAS-Bench-201 Cell Space:

  • 46,875 unique architectures constructed by stacking identical “cells,” each a 4-node DAG, with operations selected from {1×1 conv, 3×3 conv, 1×3→3×1 separable conv, 3×1→1×3 separable conv, 3×3 max-pool}; evaluated on CIFAR-10, CIFAR-100, and ImageNet16-120. Includes full training logs and measured latency/energy on all devices (Li et al., 2021).

FBNet Layer-wise Space:

  • Fixed macro-architecture of 22 searchable blocks, each selectable among 9 operator variants (kernel/expansion/width); the full design space is 92210219^{22}\approx 10^{21} networks, evaluated with blockwise cost composition (Li et al., 2021).

MnasNet (Accel-NASBench/Extended):

  • Hierarchical, block-based parameterization with 1011\sim 10^{11} architectures over 7 sequential stages, each block parameterized by expansion factor, depth, kernel size, and squeeze-excitation; designed to cover realistic, hardware-friendly CNN regimes for large-scale benchmarks (ImageNet2012) (Ahmad et al., 2024).

A summary of these spaces and operators:

Search Space Operators # Architectures Datasets
NAS-Bench-201 1x1, 3x3 convs, separables, pool 46,875 CIFAR-10/100,ImNet16-120
FBNet 9 block types (k,s,e,w) 9229^{22} ImageNet
NATS-Bench (PEL-NAS) nor_conv_3×3/1×1, avg_pool, skip 15,625 CIFAR/ImageNet16-120
MnasNet (Accel-NASBench) e (1,4,6), L (1,2,3), k (3,5), se 1011\sim 10^{11} ImageNet2012

Device latencies and/or energy are provided at per-architecture or per-block granularity, with summation schemes empirically validated to reliably approximate end-to-end inference cost (block-wise sum, with Pearson ρ0.86\rho \geq 0.86, Kendall τ0.63\tau \geq 0.63 in FBNet) (Li et al., 2021).

3. Hardware Platforms and Performance Metrics

HW-NAS-Bench explicitly profiles each architecture (or block) on six heterogeneous hardware platforms, chosen to span the spectrum of practical deployment targets and research back-ends:

  • Edge GPU: NVIDIA Jetson TX2, profiled with TensorRT.
  • Raspberry Pi 4: ARM Cortex-A72, profiled with TFLite.
  • Edge TPU: Google’s ML ASIC, profiled post-compile via TFLite/Edge TPU stack.
  • Pixel 3: Mobile CPU + NN hardware, profiled with TFLite benchmark.
  • FPGA: Xilinx ZC706/ZCU102/VCK190, measured on real board with Vivado HLS or Vitis AI stack.
  • ASIC: Eyeriss accelerator, both simulated (Accelergy+Timeloop) and analytically predicted (DNN-Chip Predictor); average result forms final estimate (Li et al., 2021, Ahmad et al., 2024).

Collected metrics include:

  • Latency: (ms), direct on-device measurement or post-compilation prediction, average over multiple runs or inference passes.
  • Energy per inference: (mJ), via power-rail instrumentation where possible or simulation.
  • Throughput: (images/sec), for MnasNet/Accel-NASBench-derived spaces (Ahmad et al., 2024).

Device-specific methods for cost estimation include LUT-based lookup, blockwise additive synthesis, and simulation-analytical hybrid averaging for ASIC. This ensures O(1)O(1) access to all hardware costs, enabling rapid iteration (Li et al., 2021).

4. Benchmark Construction, Data Access, and Methodological Impact

The HW-NAS-Bench protocol involves exhaustive or quasi-exhaustive sampling of the search space, full training of each candidate (for accuracy), and measurement or simulation of inference latency/energy on each device. In the Accel-NASBench variant, large-scale ImageNet2012 coverage is achieved via optimized proxy training schemes, yielding high Kendall’s τ0.94\tau \approx 0.94 to full training but at \sim5.6× reduced cost (Ahmad et al., 2024).

The dataset is distributed with Python APIs and REST-like interfaces, providing instant access to tuple metrics (accuracy, latency, energy) per candidate. Typical workflows integrate HW-NAS-Bench lookups with search algorithms (e.g., evolutionary NAS, ProxylessNAS, LLM-driven mutation), allowing the objective

92210219^{22}\approx 10^{21}0

to be optimized via direct querying (Li et al., 2021).

Pseudocode usage (abbreviated):

92210219^{22}\approx 10^{21}7

No architecture training or measurement is necessary at search-time, making device-specific HW-NAS fully accessible to non-experts (Li et al., 2021, Capuano et al., 1 Apr 2025, Zhu et al., 1 Oct 2025).

5. Analytical Insights and Empirical Findings

HW-NAS-Bench enables several critical empirical conclusions:

  1. FLOPs/#Params as Poor Hardware Proxies: Observed Kendall 92210219^{22}\approx 10^{21}1 correlations between FLOPs/#params and measured cost are consistently weak (<0.5) or near-zero across major devices, establishing that real device profiles are required for valid HW-NAS (Li et al., 2021, Ahmad et al., 2024).
  2. Device-Specificity of Pareto Fronts: The set of Pareto-optimal architectures in the accuracy–latency tradeoff is device-dependent; an architecture optimal on Edge GPU may be strictly suboptimal on Edge TPU or ASIC, with cross-device Kendall 92210219^{22}\approx 10^{21}2 as low as 0.0 or negative (Li et al., 2021, Ahmad et al., 2024).
  3. Blockwise Cost Additivity: For FBNet and MnasNet space, block-sum latency closely matches full-device measurements (Pearson 92210219^{22}\approx 10^{21}3) (Li et al., 2021).
  4. Rapid and Fair Comparison: Search methods can be benchmarked in seconds, ensuring parity with respect to search space, hardware targets, and cost metrics.

Example benchmark results (Edge GPU, FBNet space): searching for accuracy–latency tradeoffs via ProxylessNAS delivers distinct architectures optimal for each hardware target, directly queried via HW-NAS-Bench (Li et al., 2021).

6. Downstream Adoption and Innovations Enabled

HW-NAS-Bench forms the principal evaluation bed for novel NAS and LLM-driven hardware search paradigms:

  • Meta-training on Synthetic Devices: In "Sim-is-More," RL controllers are trained on random Gaussianizations of HW-NAS-Bench’s operator-wise latency statistics, producing policies that efficiently adapt to novel devices with a handful of real latency queries, outperforming single-device RL and random baselines in both accuracy (+2.9%) and latency (−19.5%) (Capuano et al., 1 Apr 2025).
  • LLM-Driven Co-Evolutionary NAS: PEL-NAS partitions the search space by architectural complexity, applies an LLM-powered, knowledge-base updating evolutionary loop, and efficiently navigates the HW-NAS-Bench space using zero-cost predictors and lookup tables, surpassing prior supernet and LLM-driven baselines in both Pareto front hypervolume and search efficiency (up to 53.6% lower IGD, 80.6% higher HV vs. LLMatic) (Zhu et al., 1 Oct 2025).
  • Large-Scale, Bi-Objective Accelerator Benchmarks: Accel-NASBench (termed HW-NAS-Bench in some contexts) offers surrogate models for both ImageNet2012 top-1 accuracy and device throughput/latency, with XGBoost regressors achieving 92210219^{22}\approx 10^{21}4, Kendall’s 92210219^{22}\approx 10^{21}5 across held-out models and platforms (Ahmad et al., 2024).

A representative table of cross-benchmark characteristics:

Benchmark Search Space Devices Metrics Provided Surrogate 92210219^{22}\approx 10^{21}6
HW-NAS-Bench NAS-Bench-201, FBNet 6 Acc., Latency, Energy N/A (direct)
Accel-NASBench MnasNet 7 (GPU/TPU/FPGA) Acc., Throughput, Latency >0.98

7. Impact and Future Directions

HW-NAS-Bench has established standardized empirical baselines and protocols for HW-NAS, enabling methodologically sound, reproducible studies. The dataset’s device spectrum, search space diversity, and multi-metric coverage have catalyzed advances in meta-learning, zero-cost proxy research, and LLM-driven design. The platform illuminates the necessity of per-device optimization, dispels the validity of theoretical proxies, and encourages further inclusion of emerging hardware, scalable training proxies, and real-time hardware–software co-design loops. Recent work points to opportunities in auto-discovery of partitioning schemes, adaptive profiling integration, and domain-general LLM agents leveraging HW-NAS-Bench as a real-time oracle (Li et al., 2021, Capuano et al., 1 Apr 2025, Zhu et al., 1 Oct 2025, Ahmad et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HW-NAS-Bench.