Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 67 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 461 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

ProxylessNAS: Hardware-Aware NAS

Updated 23 September 2025
  • ProxylessNAS is a neural architecture search technique that directly optimizes network architectures for target tasks and hardware without using proxy tasks.
  • It employs path binarization to select a single candidate operation per edge, drastically reducing memory usage and computational overhead.
  • The method integrates hardware-aware constraints through latency modeling, achieving state-of-the-art accuracy and efficiency on benchmarks like CIFAR-10 and ImageNet.

ProxylessNAS is a neural architecture search (NAS) methodology specifically designed to enable direct search for optimal network architectures on the target task and hardware, without relying on computationally cheaper proxies such as smaller datasets, reduced network depth, or shortened training epochs. ProxylessNAS addresses core limitations in differentiable NAS and reinforcement-learning-based NAS, particularly their prohibitive GPU memory usage and resource cost, and achieves state-of-the-art results on benchmark datasets while supporting hardware-aware and task-oriented model specialization.

1. Direct NAS on Target Task and Hardware

ProxylessNAS abandons the use of low-fidelity proxy tasks, which were necessary for early NAS schemes due to the immense computational cost of direct search on large benchmarks (e.g., ImageNet). Previous methods typically searched for architectures on surrogates and transferred them to the actual task, often introducing a significant suboptimality gap due to transfer discrepancies.

ProxylessNAS instead enables direct search on the production task/hardware by reframing architecture search as a path-level pruning challenge. The approach uses a large over-parameterized supernet, in which each edge in the computational graph contains a set of all candidate operations (for example, convolutions with various kernel sizes, expansions, and pooling). Throughout training, ProxylessNAS adaptively prunes redundant pathways by learning real-valued architecture parameters, thus selecting optimal paths for the specific deployment domain.

2. Path Binarization and Memory Efficiency

Conventional differentiable NAS techniques (such as DARTS) implement a convex combination of all candidate paths on an edge:

mo(x)=iexp(αi)jexp(αj)oi(x)m_o(x) = \sum_i \frac{\exp(\alpha_i)}{\sum_j \exp(\alpha_j)} o_i(x)

where oio_i denotes the ii-th candidate operation and αi\alpha_i its architecture parameter. This requires evaluating all NN candidates per forward pass and storing NN sets of intermediate activations, resulting in NN-fold increased memory consumption.

ProxylessNAS resolves this by introducing path binarization. At each forward iteration, only one candidate operation is instantiated using a binary mask gg sampled according to the softmax probabilities pi=exp(αi)/jexp(αj)p_i = \exp(\alpha_i)/\sum_j \exp(\alpha_j):

mo(Binary)(x)=igioi(x)m_o^{(\textrm{Binary})}(x) = \sum_i g_i \cdot o_i(x)

with gi{0,1}g_i \in \{0,1\} and igi=1\sum_i g_i = 1. This drastically reduces memory usage, matching that of a compact single-path model and permitting the search on large-scale tasks such as ImageNet under practical computational budgets.

The architecture parameters α\alpha are updated via a gradient-based method:

LαijLgjpj(δijpi)\frac{\partial L}{\partial \alpha_i} \approx \sum_j \frac{\partial L}{\partial g_j} p_j (\delta_{ij} - p_i)

Enabling gradient-based optimization while maintaining discrete candidate selection. For non-differentiable metrics, such as direct hardware latency, REINFORCE-style estimators are used as an alternative.

3. Hardware-Aware Specialization and Latency Modeling

Central to ProxylessNAS is hardware-aware model specialization, which incorporates target device constraints directly into the search process. Latency (or other hardware-specific metrics) is integrated as a differentiable regularization term:

E[latency]i=jpijF(oij)\mathbb{E}[\text{latency}]_i = \sum_j p_i^j \cdot F(o_i^j)

where F()F(\cdot) is a (possibly learned) latency predictor for operation oijo_i^j. Overall, network-wise expected latency is included in the composed loss:

Loss=LossCE+λ1w22+λ2E[latency]\textrm{Loss} = \textrm{Loss}_{CE} + \lambda_1 \|w\|_2^2 + \lambda_2 \mathbb{E}[\text{latency}]

This formulation enables the search process to discover architectures that are Pareto-optimal with respect to accuracy and inference latency on specific hardware, be it mobiles, GPUs, or CPUs.

Empirical observations show that optimal architectures diverge by hardware: GPU-optimized models tend to be shallower and wider (exploiting high parallelism), while CPU-optimized models are deeper and narrower (matching serial compute constraints). Thus, hardware-aware NAS is essential for efficient deployment, and ProxylessNAS provides explicit mechanisms for this specialization.

4. Empirical Results and Performance Metrics

ProxylessNAS achieves state-of-the-art or highly competitive results on CIFAR-10 and ImageNet. On CIFAR-10, ProxylessNAS attains a test error of 2.08% with 5.7M parameters, compared to 2.13% for AmoebaNet-B with 34.9M parameters (6×\times fewer parameters). On ImageNet, ProxylessNAS models yield a 3.1% top-1 accuracy improvement over MobileNetV2, with a 1.2×\times GPU latency speed-up. In mobile-latency-constrained settings (e.g., 80 ms target), ProxylessNAS achieves similar or better accuracy than strong alternatives such as MnasNet, while using roughly 200×\times less GPU search time.

These metrics indicate the method's effectiveness not only in computational efficiency (GPU memory and hours) during the search phase, but also in producing compact, accuracy-efficient, and hardware-specialized architectures.

5. Algorithmic Details and Optimization

ProxylessNAS employs a supernet with per-edge candidate sets and architecture parameters α\alpha. During training, standard stochastic gradient descent optimizes both network weights ww and architecture parameters, with binary path selection ensuring a compact activation footprint.

Latency is made differentiable using expectation over possible path selections, allowing its use as a loss regularizer. For discrete/non-differentiable objectives, proxylessNAS supports REINFORCE-type updates. The alternating optimization between weights and architecture parameters progresses until convergence, at which point the candidate architecture with the highest accumulated probability is sampled for standalone training and deployment.

Unlike methods relying on a softmax-weighted superposition of all paths, binarized path sampling means that at test time, the "searched" architecture matches exactly (in structure and compute graphs) the configuration used during the final retraining phase, eliminating mismatch between search and evaluation.

6. Limitations, Extensions, and Future Directions

ProxylessNAS's reliance on human-designed search spaces (i.e., sets of MobileNetV2-like blocks for candidate operations) constrains exploration to operator families already believed to be efficient. Expanding search spaces towards less human-biased or more heterogeneous operation pools remains an open challenge. Furthermore, weight-sharing supernets can introduce bias favoring larger submodels due to shared training dynamics, and may not always rank optimal architectures reliably; this requires further investigation on ranking calibration and search space design.

Extensions of ProxylessNAS include integrating advanced candidate operations (e.g., MixConv mixed kernel convolutions), federated direct NAS for privacy-aware, data-homogeneous search, and improved latency predictors that generalize better across hardware. Future research aims to refine path-level pruning techniques, close the supernet–child architecture estimation gap, and incorporate additional resource constraints (energy, memory) into the optimization loop.

Finally, the connection between path-level pruning in ProxylessNAS and model compression/quantization techniques points to hybrid approaches, where NAS doubles as structured pruning or adaptively quantizing architectures for downstream deployment.

7. Broader Impact and Significance in NAS

ProxylessNAS represents a milestone in the evolution of neural architecture search by reconciling the demands of large-scale, hardware-aware architecture optimization with practical computational constraints. Its core innovations—memory-efficient path binarization and differentiable hardware regularization—allow NAS to transition from academic benchmarks to industry-relevant workflows, where direct task-targeted and device-specialized models are indispensable. Subsequent research has built upon these principles with more expressive search spaces, improved hardware modeling, distributed/federated optimization, and deeper calibration between search and final deployment performance, further reducing the gap between automated architecture search and real-world efficient machine learning.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ProxylessNAS.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube