Hardware-Aware Neural Architecture Search

Updated 17 January 2026

Hardware-aware NAS is a method that integrates hardware constraints into neural architecture search to optimize inference latency, energy, and memory usage.
It employs multi-objective optimization, surrogate cost models, and staged search strategies to rapidly adapt network architectures for platforms like mobile, FPGA, and ASIC.
This approach has demonstrated superior accuracy and efficiency on benchmarks such as ImageNet and CIFAR while significantly reducing training and profiling costs.

Hardware-Aware Neural Architecture Search (NAS) refers to the class of automated neural architecture synthesis techniques that explicitly optimize deep network structures for deployment on particular hardware platforms, subject to resource constraints such as inference latency, energy usage, memory footprint, and device-specific efficiency targets. Recent research has established hardware-aware NAS as a critical methodology for discovering high-performance, resource-constrained models tailored to mobile, edge, embedded, FPGA, ASIC, and heterogeneous accelerator deployments. The evolution of this field incorporates multi-objective optimization, hardware-specialized search spaces, predictor-based cost modeling, and transfer mechanisms for rapid adaptation across hardware diversity (Zhang et al., 2019, Tu et al., 10 Oct 2025, Benmeziane et al., 2021).

1. Formal Objectives and Core Problem

Hardware-aware NAS reframes conventional architecture search by introducing explicit hardware-centric constraints or optimization targets. The general constrained optimization is formulated as:

$\max_{a\in\mathcal A}\; \mathrm{ACC}_{\mathrm{val}}(a)\quad\text{s.t.}\quad\tau(a,h)\;\le\;\tau_c^{(h)}$

where $a$ is the network candidate from search space $\mathcal{A}$ , $\mathrm{ACC}_{\mathrm{val}}(a)$ is its validation accuracy, $\tau(a, h)$ is measured/predicted latency of $a$ on hardware $h$ , and $\tau_c^{(h)}$ is the platform-specific latency bound (Zhang et al., 2019).

More generally, hardware-aware NAS is cast as multi-objective optimization:

$\min_{\alpha\in\mathcal S}\; \bigl[-A(\alpha),\,L(\alpha),\,E(\alpha),\,M(\alpha)\bigr]$

where $L(\alpha)$ , $E(\alpha)$ , and $M(\alpha)$ model latency, energy, and memory respectively (Benmeziane et al., 2021). Scalarizations or Pareto-based approaches are employed:

$\max\;h\bigl(A(\alpha),\,L(\alpha)\bigr) = w\,A(\alpha)\;-\;(1-w)\,L(\alpha)$

Pareto fronts characterize the trade-off space for optimal deployment candidates.

2. Search Spaces: Architectural and Hardware-Specific Encoding

Hardware-aware NAS explores expanded architectural and hardware parameter spaces, going beyond conventional cell or block-based choices:

Operator Pools and Block Choices: Modern frameworks include a rich set of convolutional blocks (MobileNetV2 MBConv, ShuffleNetV2, depthwise and separable conv variants, squeeze-and-excitation, etc.) tuned for hardware features (vector widths, group sizes) (Zhang et al., 2019, Srinivas et al., 2019).
Parameterization of HW-Critical Knobs: Search spaces are further extended to cover kernel sizes, channel counts, stride, quantization bit-widths, and energy-dominant choices, enabling fine-grained exploration of energy/latency criticality (Tu et al., 10 Oct 2025, Jiang et al., 2019).
Joint Architecture/Quantization Spaces: Some frameworks integrate per-layer quantization policy (e.g., bit-width selection per block), thereby directly incorporating mixed-precision design as part of NAS (Chen et al., 2022).

A representative pool may be up to 32 operator choices per layer, pruned by hardware-aware scores; for 20 layers, this yields search spaces up to $2\times10^{13}$ (Zhang et al., 2019).

Tables of sampled hardware-aware blocks (example):

Operator Block	Kernel Sizes	Expansion Ratios	Hardware Features
MBConv	3, 5, 7	1, 3, 6	Mobile Vector Width, SE
SEP (DW SepConv)	3, 5, 7	N/A	SIMD, Memory Alignment
Choice (Shuffle)	3, 5, 7	N/A	Channel Grouping, Shuffle
SE Variant	As above	As above	Squeeze-Excitation

3. Hardware Cost Modeling and Predictors

Early frameworks relied on explicit hardware profiling via vendor SDKs or on-device measurements (latency, energy). State-of-the-art practice incorporates learned surrogate models for inference cost prediction:

Additive Layer-wise Models: Latency is modeled as layer-wise sum of operator costs:

$\tau(a, h) \approx \sum_{i=1}^n \tau_{i,\mathrm{op}_i}(h)$

Profiles are collected per-operator and per-layer-context for each hardware target, yielding sub-5% error (Zhang et al., 2019).

Kernel-Level Energy Predictors: Compact MLPs are trained to regress per-kernel energy across diverse hardware, requiring only tens of samples to transfer (e.g., calibration on Android devices to reach ACC@20≈84.4%) (Tu et al., 10 Oct 2025).
Analytical Abstraction for FPGAs/ASICs: Models use tiling parameters, analytic formulas, and roofline-based abstractions to predict cost without full RTL/HLS synthesis (Jiang et al., 2019).
Surrogate Regressors and LUTs: Extensive use of LUTs, regression, and XGBoost/MLP surrogates for rapid cost estimation, achieving high accuracy on GPU/CPU/edge devices (Li et al., 2021, Nasir et al., 2 Aug 2025).

4. Search Algorithms and Multi-Stage Optimization

The search strategy in hardware-aware NAS is distinguished by constrained, staged, and multi-objective optimization loops:

Two-Stage Layered Search: Deep layers are searched first since their operator choices affect accuracy more than early layers and induce less latency. After caching high-score choices for deep layers, the search focuses on shallow layers (Zhang et al., 2019). This two-phase constraint dramatically reduces supernet training time by over 50%.
Constrained One-Shot NAS Supernets: Weight-sharing supernets spanning the pruned, hardware-specialized search space enable fast validation for sampled architectures, using uniform path sampling as in SPOS (Zhang et al., 2019).
Pareto-Based Evolutionary Algorithms: NSGA-II and similar heuristics are employed to balance accuracy, energy, and latency. Cost-diversity objectives are integrated to enforce exploration of the full hardware spectrum (Sinha et al., 2024). Gradient-guided candidate selection and multi-objective rewards refine selection (Tu et al., 10 Oct 2025).
Accelerated Selection and Early Pruning: Architectures violating hardware budgets are rapidly rejected before expensive training (often eliminating 50–90% of candidates up front) (Jiang et al., 2019).

Summary pseudocode of two-stage search (Zhang et al., 2019):

Input: search_space, latency_bound, layer_split
1. Initialize a_win with top-score operators per layer.
2. For iter=1,2:
   if iter==1:
      Active = deeper layers; Fixed = shallow layers
   else:
      Active = shallow layers; Fixed = deeper layers
   Search in Active with supernet + evolutionary algorithm constrained by latency.
3. Retrain final architecture from scratch.

5. Experimental Evidence and Hardware Adaptation

Extensive empirical results demonstrate the impact of hardware-aware NAS on ImageNet, CIFAR, and mobile benchmarks across DSP, CPU, VPU, GPU, FPGA, and ASIC platforms:

Consistent Pareto Superiority: HURRICANE achieves 76.67% top-1 accuracy on ImageNet with 16.5 ms DSP latency, outperforming FBNet-iPhoneX by +3.47% accuracy and 6.35× inference speedup; similar gains (+1.63% accuracy at comparable latency) on mobile CPUs. VPU results show +1.83% accuracy and 1.49× speedup over Proxyless-Mobile (Zhang et al., 2019).
Energy-Accuracy Trade-Offs: PlatformX uncovers models on Pixel 8 Pro with 0.88 accuracy at 0.21 mJ/inference (47× less energy than MobileNet-V2 at equal accuracy) and up to 0.94 accuracy at energy levels competitive with baseline hardware-efficient architectures (Tu et al., 10 Oct 2025).
Search Cost Reduction: Two-stage and cost-diversity driven searches yield 30–160× reductions in training iterations and wall-clock time relative to prior NAS approaches (Zhang et al., 2019, Sinha et al., 2024).
Cross-Hardware Transfer: Kernel-level predictors and proxy device adaptation enable rapid transfer to new devices with minimal calibration samples (Tu et al., 10 Oct 2025, Lu et al., 2021).

Hardware Target	Accuracy Δ vs. Baseline	Latency Speedup	Energy Reduction
DSP (Hexagon 685)	+3.47% (vs FBNet)	6.35×	N/A
CPU (Snapdragon)	+1.63% (vs FBNet)	Comparable	N/A
VPU (Myriad X)	+0.53% (vs Proxyless)	1.49×	N/A
Pixel 8 Pro (CPU)	+6% (acc, vs MobileNet)	up to 47×	up to 47×

6. Design Analysis and Future Research Directions

Hardware-aware NAS frameworks highlight critical insights:

Hardware-Specialized Pruning and Scoring: Operator selection is dominated by real hardware latency, not FLOPs. Direct profiling guides search-space reduction for each device, enabling NAS to focus on pragmatically efficient solutions (Zhang et al., 2019, Li et al., 2021).
Dynamic and Energy-Aware Expansion: Enriching the parameter space with HW-critical options and integrating seamless cost predictors exposes models that traditional accuracy-focused search would overlook (Tu et al., 10 Oct 2025, Srinivas et al., 2019).
Multi-Objective Evolution: Simple scalarization fails to fully exploit hardware-accuracy trade-offs. Pareto and diversity-driven strategies yield fuller frontiers and better exploration (Tu et al., 10 Oct 2025, Sinha et al., 2024).
Scalability and Practicality: Learning-based predictors, staged and population-diversity strategies, and proxy adaptation methods sharply reduce real-device profiling costs, supporting rapid deployment across device heterogeneity (Lu et al., 2021, Benmeziane et al., 2021).

Identified limitations and directions:

Current practice is primarily latency-centric; energy, memory, and multi-tenant constraints are underexplored but critical.
Most approaches rely on single-metric latency or energy; incorporating multi-objective (energy/latency/memory) constraints is a nascent area (Sinha et al., 2024, Tu et al., 10 Oct 2025).
Adaptive approaches for staged search and per-layer splitting are required for automated scalability across architectures and tasks (Zhang et al., 2019).
Extending to hardware accelerators (GPU, FPGA, NPU) and broadening benchmarks (HW-NAS-Bench, NAS-Bench-201/301) are necessary for generalization and reproducibility (Li et al., 2021).
Democratized toolchains supporting non-hardware experts remain an open deployment frontier.

7. Representative Frameworks and Benchmarks

Key frameworks and public benchmarks have catalyzed the field:

HURRICANE: Two-stage constrained search, hardware-specialized operator pools, latency-driven supernet optimization across DSP/CPU/VPU (Zhang et al., 2019).
PlatformX: Energy-driven kernel-level predictor, automated profiling, Pareto-gradient sampling, and fully transferable search, validated on modern smartphones (Tu et al., 10 Oct 2025).
HW-NAS-Bench: First public dataset for reproducible hardware-aware NAS evaluation, covering NAS-Bench-201 and FBNet across six hardware targets (Li et al., 2021).
FBNet, ProxylessNAS, MNASNet, NSGA-Net, HQNAS: Prototype techniques using differentiable NAS, RL, evolutionary multi-objective search, and joint quantization-in-the-loop (Srinivas et al., 2019, Chen et al., 2022, Jiang et al., 2019, Benmeziane et al., 2021).
SONATA, MO-HDNAS: Self-adaptive evolutionary methods leveraging surrogate predictors, reinforcement learning, and objective diversity on real edge devices (Bouzidi et al., 2024, Sinha et al., 2024).