Once-for-All: Train One Network and Specialize it for Efficient Deployment (1908.09791v5)

Published 26 Aug 2019 in cs.LG, cs.CV, and stat.ML

Abstract: We address the challenging problem of efficient inference across many devices and resource constraints, especially on edge devices. Conventional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally prohibitive (causing $CO_2$ emission as much as 5 cars' lifetime) thus unscalable. In this work, we propose to train a once-for-all (OFA) network that supports diverse architectural settings by decoupling training and search, to reduce the cost. We can quickly get a specialized sub-network by selecting from the OFA network without additional training. To efficiently train OFA networks, we also propose a novel progressive shrinking algorithm, a generalized pruning method that reduces the model size across many more dimensions than pruning (depth, width, kernel size, and resolution). It can obtain a surprisingly large number of sub-networks ($> 10^{19}$) that can fit different hardware platforms and latency constraints while maintaining the same level of accuracy as training independently. On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4.0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1.5x faster than MobileNetV3, 2.6x faster than EfficientNet w.r.t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission. In particular, OFA achieves a new SOTA 80.0% ImageNet top-1 accuracy under the mobile setting ($<$600M MACs). OFA is the winning solution for the 3rd Low Power Computer Vision Challenge (LPCVC), DSP classification track and the 4th LPCVC, both classification track and detection track. Code and 50 pre-trained models (for many devices & many latency constraints) are released at https://github.com/mit-han-lab/once-for-all.

Authors (5)

Han Cai (79 papers)
Chuang Gan (195 papers)
Tianzhe Wang (4 papers)
Zhekai Zhang (11 papers)
Song Han (155 papers)

Citations (1,185)

View on Semantic Scholar

Summary

The paper introduces the Once-for-All (OFA) methodology that decouples training from architecture search to efficiently deploy deep neural networks.
It employs a progressive shrinking technique to train one expansive network and fine-tune sub-networks with varying depth, width, kernel size, and resolution.
Experimental results show enhanced accuracy and reduced computation and energy costs, optimizing deployments from mobile devices to GPUs.

Once-for-All: Train One Network and Specialize it for Efficient Deployment

The paper "Once-for-All: Train One Network and Specialize it for Efficient Deployment" by Han Cai et al. presents a significant contribution to the field of efficient deep learning model deployment. The authors introduce the Once-for-All (OFA) methodology, focusing on decoupling the neural network training process from the architecture search to optimize resource usage for deploying deep neural networks (DNNs) across diverse hardware platforms and efficiency constraints.

Problem Statement

The explosive increase in the complexity and size of neural networks has made it challenging to deploy them effectively across varying platforms and hardware configurations. Traditional approaches either rely on manual design or Neural Architecture Search (NAS), both of which require retraining a specialized model for every deployment scenario. This process results in substantial computational expenses and energy consumption, making it unsustainable for large-scale applications.

Methodology

The OFA approach proposes training a single, versatile network that can adapt to different architectural configurations without the need for retraining. This is achieved through a two-stage process:

Training the Once-for-All Network: A single extensive network is trained once, encompassing a wide range of configurations in terms of depth, width, kernel size, and image resolution.
Progressive Shrinking: A novel technique proposed by the authors, where the largest network is initially trained to optimize for the most complex configurations and subsequently fine-tuned to support smaller sub-networks. This method helps avoid interference between sub-networks and preserves the accuracy of smaller models.

Architecture Space

The architecture space of the OFA network is designed to cover multiple dimensions:

Elastic Depth: Different layers configurations.
Elastic Width: Various amounts of channels.
Elastic Kernel Size: Adaptable kernel sizes.
Elastic Resolution: Multiple input image sizes.

This flexibility allows the OFA network to support over $10^{19}$ sub-networks, all sharing the same weights, significantly reducing the model size.

Training and Deployment

Training Procedure

The training of the OFA network is divided into stages:

Initial training of the largest network.
Progressive incorporation of elastic kernel sizes, depths, and widths.
Fine-tuning at each stage to ensure higher accuracy for sub-networks.

Deployment

For deploying a specialized sub-network for a given hardware constraint:

Architecture Search: An evolutionary search is guided by neural-network-twins predicting accuracy and latency, a highly efficient process compared to exhaustive searches.

Experimental Results

The effectiveness of the OFA methodology is extensively validated across diverse hardware platforms (e.g., mobile devices, GPUs, FPGAs) with varying latency and resource constraints. Key findings include:

ImageNet Performance: OFA-achieved models significantly outperform state-of-the-art NAS-based models in terms of accuracy and efficiency.
Efficiency Gains: The OFA approach reduces computational costs and CO₂ emissions by up to multiple orders of magnitude. For instance, achieving 80.0% ImageNet top-1 accuracy with less than 600M MACs, outperforming EfficientNet-B0 with substantially fewer computations and faster execution times on hardware.
Transferability: The architecture search and specialization of sub-networks using the OFA model demonstrate significant efficiency improvements across different hardware settings, from cloud-based GPUs to edge devices like mobile phones and FPGAs.

Implications and Future Developments

The OFA methodology not only addresses the immediate challenge of efficiently deploying DNNs but also sets a precedent for future research in the following areas:

Automated Model Optimization: The decoupling of training and architecture search allows scalable and sustainable deployment across numerous platforms.
Green AI: By significantly reducing the environmental impact of model training and deployment, the OFA method aligns with emerging concerns about the carbon footprint of AI research.
Hardware-Aware Design: OFA's ability to tailor models for specific hardware constraints could drive innovations in hardware-aware machine learning model design optimizations.

Given the profound implications for practical deployment, the OFA framework establishes a robust basis for future advancements in efficient AI deployment, promising more adaptive and resource-aware machine learning applications.

PDF Markdown

Related Papers

GitHub

GitHub - mit-han-lab/once-for-all: [ICLR 2020] Once for All: Train One Network and Specialize it for Efficient Deployment (1,844 stars)

YouTube

Show All Videos