HW-NAS: Hardware-Aware Neural Search
- HW-NAS is an automated neural architecture search method that incorporates hardware metrics like latency, energy, and memory to optimize both accuracy and efficiency.
- It employs various hardware cost estimation strategies such as lookup tables, analytical models, and surrogate predictors to ensure precise evaluation of deployment constraints.
- HW-NAS leverages diverse search algorithms—including differentiable, evolutionary, and reinforcement learning approaches—to achieve a balanced, Pareto-optimal trade-off between performance and hardware efficiency.
Hardware-Aware Neural Architecture Search (HW-NAS) is a paradigm in automated neural network design that explicitly incorporates hardware metrics such as latency, energy, and memory into the architecture search process. It seeks neural networks that both maximize task performance and optimally exploit the performance–efficiency trade-offs of specific deployment hardware, including edge devices, accelerators, microcontrollers, and custom ASICs. This approach addresses the critical challenge of bridging the algorithm–hardware gap for deep learning deployment at scale and under constraints.
1. Problem Definition and Optimization Objectives
HW-NAS formalizes neural network architecture search as a multi-objective optimization over both accuracy and hardware cost metrics. Let α denote a candidate architecture selected from search space 𝒜, and w its weights. A hardware-aware search process is characterized by objectives such as task loss L(α, w) and hardware-specific cost C_hw(α) (e.g., latency, energy, memory), typically captured by one of the following formulations (Bachiri et al., 2024):
- Bi-objective Pareto optimization:
- Weighted single-objective:
- Constrained single-objective:
Here, λ is a tunable regularization parameter, and C_max defines a hard hardware budget, such as a maximum allowable inference latency.
2. Hardware Cost Estimation Strategies
Accurate hardware metric estimation is a core component of HW-NAS and drives the fidelity of the search outcomes (Bachiri et al., 2024). Methods fall into three categories:
- Lookup Table (LUT) Profiling: Pre-characterize every primitive operator's cost on the target device, then aggregate per-architecture costs by summing primitive values (e.g., FBNet).
- Analytical Models: Estimate metrics based on closed-form formulas (e.g., summing FLOPs, memory accesses, or pipeline depth; as in MnasNet).
- Surrogate Predictors: Train regression models (e.g., XGBoost, MLPs) using empirical measurements to predict hardware cost from architecture encodings (e.g., ProxylessNAS, HyT-NAS).
LUT and analytical approaches offer computational efficiency but may have limited generalizability across devices. Surrogate predictors, when trained with sufficient measurement data, can support rapid evaluation and transferability but introduce model bias and require calibration (Ma et al., 31 Jul 2025, Sinha et al., 2023, Mecharbat et al., 2023).
3. Search Algorithms and Optimization Frameworks
HW-NAS leverages multiple classes of search algorithms, with the choice dictated by search space size and objective complexity (Bachiri et al., 2024):
- Differentiable NAS: Relaxes architectural choices to continuous variables, enabling joint optimization of weights and architecture parameters via gradient descent—typically includes hardware-aware regularization directly in the loss (FBNet, ProxylessNAS).
- Evolutionary Algorithms: Maintain a population of architectures, applying mutation and crossover, and select the Pareto front based on metrics such as accuracy and latency (e.g., MnasNet, SONATA (Bouzidi et al., 2024), MO-HDNAS (Sinha et al., 2024)).
- Reinforcement Learning: Uses an RNN controller to generate architectures, optimizing for a reward that combines accuracy and hardware metrics (e.g., NASNet-A, (Bachiri et al., 2024)).
- Multi-Objective Bayesian Optimization: Employs acquisition functions such as Expected Hypervolume Improvement (EHVI) with GP or sparse GP surrogates for scalable search in large discrete spaces (Coflex).
Recent advances integrate meta-learning for cross-device transfer (HELP (Lee et al., 2021)) and LLM-driven prompting (PEL-NAS (Zhu et al., 1 Oct 2025)) to further improve efficiency and scalability.
4. Practical Frameworks, Benchmarks, and Empirical Outcomes
Numerous frameworks and benchmarks have accelerated HW-NAS research and deployment:
- HW-NAS-Bench is a public dataset providing measured hardware costs (latency, energy) for NAS-201 and FBNet spaces across six hardware platforms, enabling fair and reproducible benchmarking (Li et al., 2021).
- Coflex uses Sparse Gaussian Processes within Bayesian multi-objective optimization for co-searching neural and accelerator parameters, providing order-of-magnitude speedups (Ma et al., 31 Jul 2025).
- HyT-NAS extends NAS to hybrid convolution–transformer search spaces, integrating surrogate accuracy and latency predictors to enable efficient edge-device targeting (Mecharbat et al., 2023).
- MicroNAS applies differentiable NAS with LUT-based profiling and explicit peak-memory modeling for strict MCU compliance (King et al., 2023).
- PEL-NAS utilizes LLM-driven co-evolution and search space partitioning to overcome prompt mode-collapse and achieves diverse Pareto fronts with search times reduced to minutes (Zhu et al., 1 Oct 2025).
- One-Proxy NAS exploits latency ranking monotonicity for efficient device transfer: optimal Pareto fronts can often be recovered with only 30–50 on-device probes per target (Lu et al., 2021).
Empirically, methods such as HURRICANE (Zhang et al., 2019) and HW-EvRSNAS (Sinha et al., 2023) achieve significant improvements in accuracy and constraint satisfaction—with up to 8000× search time reduction and substantial device-level acceleration compared to prior art.
5. Extensions: Hardware/Software Co-Exploration and Joint Code Optimization
Traditional HW-NAS assumes a fixed target hardware/software stack. Recent work generalizes this to joint exploration of both neural architectures and hardware implementations (Jiang et al., 2019, Krestinskaya et al., 30 Sep 2025, Bachiri et al., 2024). These frameworks treat the combined search over network topologies, quantization policies, microarchitecture features, and compiler schedules as a coupled multi-objective problem.
- Co-exploration finds optimal pairs of networks and hardware mappings, shifting the Pareto front toward regimes unachievable under fixed hardware ((Jiang et al., 2019), CIMNAS (Krestinskaya et al., 30 Sep 2025)).
- NACOS (Neural Architecture and Compiler Optimizations co-Search) couples architecture search with automatic code optimization, allowing simultaneous or staged (two-stage) searches over network parameters and compiler schedules. Empirical findings demonstrate that joint search surfaces yield up to 30% greater speed or efficiency compared to independent optimizations (Bachiri et al., 2024).
6. Multi-Objective and Diversity-Preserving Strategies
State-of-the-art HW-NAS increasingly adopts true Pareto multi-objective optimization to avoid collapse to narrow cost/accuracy bands (Sinha et al., 2024):
- Representation similarity metrics (e.g., mutual information with a reference model) serve as accurate, training-free proxies for test accuracy, dramatically reducing search cost (Sinha et al., 2023).
- Diversity objectives (e.g., cost diversity) are incorporated into evolutionary frameworks (MO-HDNAS (Sinha et al., 2024)) to ensure that the final set of optimized architectures spans the full range of hardware budgets, producing a comprehensive trade-off frontier in a single run.
7. Trends, Challenges, and Future Directions
Open challenges and ongoing directions in HW-NAS research include (Benmeziane et al., 2021, Bachiri et al., 2024):
- Cross-device generalization: Efficient transfer of hardware cost predictors and NAS solutions across diverse and unseen devices using meta-learning, proxy adaptation, and few-shot strategies (HELP (Lee et al., 2021), Multi-Predict (Akhauri et al., 2023), One Proxy (Lu et al., 2021)).
- Emerging Hardware Paradigms: Jointly searching over software/hardware spaces specific to novel accelerators, such as compute-in-memory (e.g., RRAM, SRAM), and integrating device-level non-idealities into the search loop (CIMNAS (Krestinskaya et al., 30 Sep 2025)).
- Scalable Surrogate Models: Sparse surrogates and zero-cost predictors are critical for scaling search to joint spaces comprising 1018–1085 configurations (Ma et al., 31 Jul 2025, Krestinskaya et al., 30 Sep 2025).
- Automated Hardware Profiling: High-resolution, device-agnostic profiling frameworks (PlatformX (Tu et al., 10 Oct 2025)) close the loop for energy–accuracy trade-offs and eliminate the need for extensive manual operator benchmarking.
- Diversity and Generality: Ensuring that search outcomes are robust across tasks (detection, segmentation, NLP) and that diversity in cost/accuracy is maintained to support application-driven deployment post hoc.
By integrating hardware metrics directly into the architecture search and leveraging advanced optimization, modeling, and adaptation techniques, HW-NAS has become a foundational technology for generating efficient, deployable neural networks across the wide spectrum of modern hardware targets.