ProxylessNAS: Hardware-Aware NAS
- ProxylessNAS is a neural architecture search technique that directly optimizes network architectures for target tasks and hardware without using proxy tasks.
- It employs path binarization to select a single candidate operation per edge, drastically reducing memory usage and computational overhead.
- The method integrates hardware-aware constraints through latency modeling, achieving state-of-the-art accuracy and efficiency on benchmarks like CIFAR-10 and ImageNet.
ProxylessNAS is a neural architecture search (NAS) methodology specifically designed to enable direct search for optimal network architectures on the target task and hardware, without relying on computationally cheaper proxies such as smaller datasets, reduced network depth, or shortened training epochs. ProxylessNAS addresses core limitations in differentiable NAS and reinforcement-learning-based NAS, particularly their prohibitive GPU memory usage and resource cost, and achieves state-of-the-art results on benchmark datasets while supporting hardware-aware and task-oriented model specialization.
1. Direct NAS on Target Task and Hardware
ProxylessNAS abandons the use of low-fidelity proxy tasks, which were necessary for early NAS schemes due to the immense computational cost of direct search on large benchmarks (e.g., ImageNet). Previous methods typically searched for architectures on surrogates and transferred them to the actual task, often introducing a significant suboptimality gap due to transfer discrepancies.
ProxylessNAS instead enables direct search on the production task/hardware by reframing architecture search as a path-level pruning challenge. The approach uses a large over-parameterized supernet, in which each edge in the computational graph contains a set of all candidate operations (for example, convolutions with various kernel sizes, expansions, and pooling). Throughout training, ProxylessNAS adaptively prunes redundant pathways by learning real-valued architecture parameters, thus selecting optimal paths for the specific deployment domain.
2. Path Binarization and Memory Efficiency
Conventional differentiable NAS techniques (such as DARTS) implement a convex combination of all candidate paths on an edge:
where denotes the -th candidate operation and its architecture parameter. This requires evaluating all candidates per forward pass and storing sets of intermediate activations, resulting in -fold increased memory consumption.
ProxylessNAS resolves this by introducing path binarization. At each forward iteration, only one candidate operation is instantiated using a binary mask sampled according to the softmax probabilities :
with and . This drastically reduces memory usage, matching that of a compact single-path model and permitting the search on large-scale tasks such as ImageNet under practical computational budgets.
The architecture parameters are updated via a gradient-based method:
Enabling gradient-based optimization while maintaining discrete candidate selection. For non-differentiable metrics, such as direct hardware latency, REINFORCE-style estimators are used as an alternative.
3. Hardware-Aware Specialization and Latency Modeling
Central to ProxylessNAS is hardware-aware model specialization, which incorporates target device constraints directly into the search process. Latency (or other hardware-specific metrics) is integrated as a differentiable regularization term:
where is a (possibly learned) latency predictor for operation . Overall, network-wise expected latency is included in the composed loss:
This formulation enables the search process to discover architectures that are Pareto-optimal with respect to accuracy and inference latency on specific hardware, be it mobiles, GPUs, or CPUs.
Empirical observations show that optimal architectures diverge by hardware: GPU-optimized models tend to be shallower and wider (exploiting high parallelism), while CPU-optimized models are deeper and narrower (matching serial compute constraints). Thus, hardware-aware NAS is essential for efficient deployment, and ProxylessNAS provides explicit mechanisms for this specialization.
4. Empirical Results and Performance Metrics
ProxylessNAS achieves state-of-the-art or highly competitive results on CIFAR-10 and ImageNet. On CIFAR-10, ProxylessNAS attains a test error of 2.08% with 5.7M parameters, compared to 2.13% for AmoebaNet-B with 34.9M parameters (6 fewer parameters). On ImageNet, ProxylessNAS models yield a 3.1% top-1 accuracy improvement over MobileNetV2, with a 1.2 GPU latency speed-up. In mobile-latency-constrained settings (e.g., 80 ms target), ProxylessNAS achieves similar or better accuracy than strong alternatives such as MnasNet, while using roughly 200 less GPU search time.
These metrics indicate the method's effectiveness not only in computational efficiency (GPU memory and hours) during the search phase, but also in producing compact, accuracy-efficient, and hardware-specialized architectures.
5. Algorithmic Details and Optimization
ProxylessNAS employs a supernet with per-edge candidate sets and architecture parameters . During training, standard stochastic gradient descent optimizes both network weights and architecture parameters, with binary path selection ensuring a compact activation footprint.
Latency is made differentiable using expectation over possible path selections, allowing its use as a loss regularizer. For discrete/non-differentiable objectives, proxylessNAS supports REINFORCE-type updates. The alternating optimization between weights and architecture parameters progresses until convergence, at which point the candidate architecture with the highest accumulated probability is sampled for standalone training and deployment.
Unlike methods relying on a softmax-weighted superposition of all paths, binarized path sampling means that at test time, the "searched" architecture matches exactly (in structure and compute graphs) the configuration used during the final retraining phase, eliminating mismatch between search and evaluation.
6. Limitations, Extensions, and Future Directions
ProxylessNAS's reliance on human-designed search spaces (i.e., sets of MobileNetV2-like blocks for candidate operations) constrains exploration to operator families already believed to be efficient. Expanding search spaces towards less human-biased or more heterogeneous operation pools remains an open challenge. Furthermore, weight-sharing supernets can introduce bias favoring larger submodels due to shared training dynamics, and may not always rank optimal architectures reliably; this requires further investigation on ranking calibration and search space design.
Extensions of ProxylessNAS include integrating advanced candidate operations (e.g., MixConv mixed kernel convolutions), federated direct NAS for privacy-aware, data-homogeneous search, and improved latency predictors that generalize better across hardware. Future research aims to refine path-level pruning techniques, close the supernet–child architecture estimation gap, and incorporate additional resource constraints (energy, memory) into the optimization loop.
Finally, the connection between path-level pruning in ProxylessNAS and model compression/quantization techniques points to hybrid approaches, where NAS doubles as structured pruning or adaptively quantizing architectures for downstream deployment.
7. Broader Impact and Significance in NAS
ProxylessNAS represents a milestone in the evolution of neural architecture search by reconciling the demands of large-scale, hardware-aware architecture optimization with practical computational constraints. Its core innovations—memory-efficient path binarization and differentiable hardware regularization—allow NAS to transition from academic benchmarks to industry-relevant workflows, where direct task-targeted and device-specialized models are indispensable. Subsequent research has built upon these principles with more expressive search spaces, improved hardware modeling, distributed/federated optimization, and deeper calibration between search and final deployment performance, further reducing the gap between automated architecture search and real-world efficient machine learning.