Blockwisely Supervised Neural Architecture Search with Knowledge Distillation (1911.13053v2)

Published 29 Nov 2019 in cs.CV, cs.LG, and cs.NE

Abstract: Neural Architecture Search (NAS), aiming at automatically designing network architectures by machines, is hoped and expected to bring about a new revolution in machine learning. Despite these high expectation, the effectiveness and efficiency of existing NAS solutions are unclear, with some recent works going so far as to suggest that many existing NAS solutions are no better than random architecture selection. The inefficiency of NAS solutions may be attributed to inaccurate architecture evaluation. Specifically, to speed up NAS, recent works have proposed under-training different candidate architectures in a large search space concurrently by using shared network parameters; however, this has resulted in incorrect architecture ratings and furthered the ineffectiveness of NAS. In this work, we propose to modularize the large search space of NAS into blocks to ensure that the potential candidate architectures are fully trained; this reduces the representation shift caused by the shared parameters and leads to the correct rating of the candidates. Thanks to the block-wise search, we can also evaluate all of the candidate architectures within a block. Moreover, we find that the knowledge of a network model lies not only in the network parameters but also in the network architecture. Therefore, we propose to distill the neural architecture (DNA) knowledge from a teacher model as the supervision to guide our block-wise architecture search, which significantly improves the effectiveness of NAS. Remarkably, the capacity of our searched architecture has exceeded the teacher model, demonstrating the practicability and scalability of our method. Finally, our method achieves a state-of-the-art 78.4\% top-1 accuracy on ImageNet in a mobile setting, which is about a 2.1\% gain over EfficientNet-B0. All of our searched models along with the evaluation code are available online.

PDF Abstract

Blockwisely Supervised Neural Architecture Search with Knowledge Distillation

The paper "Blockwisely Supervised Neural Architecture Search with Knowledge Distillation" presents a novel approach to Neural Architecture Search (NAS), aiming to improve both efficiency and accuracy in automatically designing network architectures. The key innovation lies in modularizing the search space into blocks and employing knowledge distillation from a teacher model. This method significantly mitigates the challenges associated with traditional NAS methods, such as inaccurate architecture evaluation and inefficient convergence.

Overview

Traditional NAS methodologies often face the problem of scalability and evaluation inefficiency, sometimes resulting in suboptimal architectures that do not outperform random selection. These inefficiencies are primarily due to shared parameter training across vast search spaces, which leads to representation shifts and incorrect architecture rankings. The proposed approach addresses these challenges through a blockwise NAS strategy, modularizing the large search space into smaller, manageable blocks. This ensures that potential candidate architectures are fully trained within each block, reducing representation shifts and correcting candidate ratings.

Methodology

Blockwise Neural Architecture Search: The authors propose dividing the network's architecture into several blocks, managed independently. This modular approach drastically reduces the search space size, ensuring candidate architectures within each block are fully trained and rated accurately. Such division allows for comprehensive evaluation and fair training of each sub-model within the block, thereby maintaining a high degree of architecture fidelity during the search process.

Knowledge Distillation: The architecture search in each block is supervised using distilled knowledge from a teacher model. The innovative aspect here is the recognition that knowledge lies not only in network parameters but also in the architectural design itself. By distilling structural knowledge from the teacher model, the candidate architectures are guided towards superior designs that can effectively mimic the teacher's behavior while potentially exceeding its performance capabilities.

Parallelization via Teacher Model Input: Inspired by advancements in transformer models in NLP, the method incorporates parallel block training. Each block uses the feature maps from the previous block of the teacher model as input for training, enhancing efficiency without sacrificing learning depth.

Results and Implications

The approach demonstrates substantial empirical improvements, achieving a state-of-the-art 78.4% top-1 accuracy on ImageNet with a mobile setting, a significant gain over EfficientNet-B0. The paper also details strong performance on CIFAR10 and CIFAR100, reinforcing the model's generalization capabilities across diverse datasets.

The ability to exceed the performance of the teacher model highlights the promise of this method in scalable and practical application scenarios. It provides a robust framework for developing architectures that are not only effective but also computationally efficient—an essential consideration for deploying deep learning models in resource-constrained environments such as mobile devices.

Future Directions

This research invites further exploration into blockwise NAS, particularly in expanding its applicability across different types of neural networks and tasks beyond image classification. Future works might investigate integrating more sophisticated knowledge distillation techniques, incorporating other forms of neural network knowledge beyond feature maps, and optimizing block configurations for specific tasks.

Moreover, exploring alternative parallelization strategies and improving the efficiency of blockwise evaluations could unlock faster and more cost-effective NAS. This paper lays a solid foundation for these advancements, offering a forward-looking perspective on automated neural architecture design.

In conclusion, the paper presents a compelling case for rethinking NAS through a structured and knowledge-driven lens, leveraging both the decomposition of complex tasks and the harnessing of architectural wisdom to achieve high-performance neural network models.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Changlin Li (28 papers)
Jiefeng Peng (8 papers)
Liuchun Yuan (5 papers)
Guangrun Wang (43 papers)
Xiaodan Liang (318 papers)
Liang Lin (318 papers)
Xiaojun Chang (148 papers)

Citations (173)

View on Semantic Scholar

Blockwisely Supervised Neural Architecture Search with Knowledge Distillation (1911.13053v2)