BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models (2003.11142v3)

Published 24 Mar 2020 in cs.CV

Abstract: Neural architecture search (NAS) has shown promising results discovering models that are both accurate and fast. For NAS, training a one-shot model has become a popular strategy to rank the relative quality of different architectures (child models) using a single set of shared weights. However, while one-shot model weights can effectively rank different network architectures, the absolute accuracies from these shared weights are typically far below those obtained from stand-alone training. To compensate, existing methods assume that the weights must be retrained, finetuned, or otherwise post-processed after the search is completed. These steps significantly increase the compute requirements and complexity of the architecture search and model deployment. In this work, we propose BigNAS, an approach that challenges the conventional wisdom that post-processing of the weights is necessary to get good prediction accuracies. Without extra retraining or post-processing steps, we are able to train a single set of shared weights on ImageNet and use these weights to obtain child models whose sizes range from 200 to 1000 MFLOPs. Our discovered model family, BigNASModels, achieve top-1 accuracies ranging from 76.5% to 80.9%, surpassing state-of-the-art models in this range including EfficientNets and Once-for-All networks without extra retraining or post-processing. We present ablative study and analysis to further understand the proposed BigNASModels.

Citations (293)

View on Semantic Scholar

Summary

The paper presents a one-shot NAS method that directly extracts high-performance child models, achieving top-1 accuracies from 76.5% to 80.9% on ImageNet.
It leverages innovative techniques like the sandwich rule and inplace distillation to balance training across models of varying sizes.
Specialized initialization, adaptive learning rate schedules, and batch norm calibration stabilize training and simplify deployment on diverse hardware.

BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models

The paper "BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models" introduces a novel methodology for neural architecture search (NAS), emphasizing a streamlined approach in comparison to existing paradigms that require extensive post-processing. Traditionally, NAS involves training a one-shot model to rank various architectures using shared weights, with additional retraining or fine-tuning necessary to achieve stand-alone accuracies. BigNAS proposes an alternative strategy that negates the need for these additional computationally expensive steps.

BigNAS employs a single-stage model that is trained on ImageNet, allowing the direct extraction of high-quality child models with varying computational constraints (from 200 to 1000 MFLOPs) without post-training modifications. The models, termed BigNASModels, achieve superior performance metrics with top-1 accuracies ranging from 76.5% to 80.9%, exceeding state-of-the-art methods like EfficientNets and Once-for-All networks in the same computational range.

Key Contributions and Techniques

BigNAS's primary innovation is in maintaining the integrity of the one-shot model’s shared weights, thereby simplifying the NAS workflow. The research explores several novel techniques to manage the variance in learning dynamics between smaller and larger child models:

Sandwich Rule and Inplace Distillation: Extending principles from slimmable networks, these techniques support simultaneous training of diverse-size networks, allowing smaller models to benefit from a form of knowledge distillation via estimating accuracies from the larger ones.
Initialization and Convergence Strategies: The authors recognize a requirement for specialized initialization and learning rate schedules to stabilize the training of large single-stage models. Their proposed modification to learning rate schedules mitigates the convergence disparities that typically favor larger models' rapid overfitting tendencies over smaller networks' slower learning curves.
Simplified Regularization: By applying regularization selectively to the largest child model, the paper addresses both the overfitting tendencies of larger networks and the underfitting risks in smaller ones, notably without degrading the overall model’s training pipeline.
Batch Norm Calibration: Post-training recalibration of batch normalization statistics ensures accurate deployment without retraining, which is critical for maintaining model consistency across diverse deployments.

Implications and Future Directions

BigNAS significantly reduces the complexity and computational cost associated with NAS by eliminating retraining and fine-tuning requirements. This simplification allows for more flexible deployment of models in varied hardware environments, such as edge devices with constraints on latency, memory, and processing power.

Theoretical implications involve a challenge to the conventional understanding that retraining from scratch is essential for optimal NAS results. Practitioners now have a framework that leverages shared weights for immediate deployment across multiple architectural configurations, thereby facilitating rapid experimentation and deployment.

Future developments could explore extending BigNAS's methodology to other domains beyond ImageNet classification, potentially integrating with self-supervised learning paradigms or expanding the range of searchable architectures. Additionally, refining techniques for more granular architecture selection might yield further efficiency gains.

In conclusion, BigNAS innovatively reduces NAS’s complexity and computational demands, offering a robust solution for scalable and efficient model deployment while setting the stage for future exploration into more generalized applications of NAS methods.

PDF Markdown

BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models (2003.11142v3)

Summary

BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models

Key Contributions and Techniques

Implications and Future Directions

Related Papers