AutoFormer: Searching Transformers for Visual Recognition (2107.00651v1)

Published 1 Jul 2021 in cs.CV

Abstract: Recently, pure transformer-based models have shown great potentials for vision tasks such as image classification and detection. However, the design of transformer networks is challenging. It has been observed that the depth, embedding dimension, and number of heads can largely affect the performance of vision transformers. Previous models configure these dimensions based upon manual crafting. In this work, we propose a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search. AutoFormer entangles the weights of different blocks in the same layers during supernet training. Benefiting from the strategy, the trained supernet allows thousands of subnets to be very well-trained. Specifically, the performance of these subnets with weights inherited from the supernet is comparable to those retrained from scratch. Besides, the searched models, which we refer to AutoFormers, surpass the recent state-of-the-arts such as ViT and DeiT. In particular, AutoFormer-tiny/small/base achieve 74.7%/81.7%/82.4% top-1 accuracy on ImageNet with 5.7M/22.9M/53.7M parameters, respectively. Lastly, we verify the transferability of AutoFormer by providing the performance on downstream benchmarks and distillation experiments. Code and models are available at https://github.com/microsoft/AutoML.

PDF Abstract

Overview of AutoFormer: Searching Transformers for Visual Recognition

This paper introduces AutoFormer, a novel one-shot architecture search framework specifically designed to optimize the architecture of vision transformers for tasks such as image classification and detection. Vision transformers have demonstrated significant potential due to their ability to capture long-range dependencies, yet designing these networks remains complex due to the various parameters that influence performance, such as depth, embedding dimension, and number of attention heads.

Core Contributions

The authors present AutoFormer as a dedicated solution to automate the search for optimal transformer configurations, addressing key challenges: determining the ideal balance of architectural parameters and efficiently exploring diverse model structures. AutoFormer introduces a weight entanglement strategy that enhances the training of supernets, allowing a multitude of subnets to perform comparably to those trained independently from scratch.

Key Findings

Performance Superiority: AutoFormer outperforms state-of-the-art models like ViT and DeiT. Notably, AutoFormer-tiny, small, and base models achieve top-1 accuracies on ImageNet of 74.7\%, 81.7\%, and 82.4\% with parameters 5.7M, 22.9M, and 53.7M, respectively.
Efficient Search: The framework efficiently identifies promising transformer architectures that fit varying resource constraints using evolutionary search strategies, capitalizing on the shared weights of the supernet.

Methodological Innovations

AutoFormer leverages a comprehensive search space encompassing variable factors, including embedding dimension, $Q$ - $K$ - $V$ dimensions, MLP ratio, and network depth. The framework uses a supernet where different transformer block configurations share weights, a key distinction from traditional NAS approaches that employ isolated weights for different operators in the same layer.

Practical and Theoretical Implications

Practically, AutoFormer facilitates the design of sophisticated vision transformers without exhaustive manual tuning, offering competitive performance with reduced engineering effort. The method opens avenues for developing more adaptive models capable of aligning with specific hardware constraints and application needs.

Theoretically, the introduction of weight entanglement suggests a novel approach for training transformers, potentially impacting how neural networks are conceptualized in resource-limited environments. This concept could influence future research on model optimization, extending beyond vision topics to broader transformer applications.

Speculation on Future Developments

Given the increasing application of transformers in diverse computational fields beyond vision, AutoFormer's methodology may inspire similar frameworks across domains such as natural language processing and reinforcement learning. Future developments might focus on expanding the search space to integrate convolutional elements, offering a more unified model design approach.

In conclusion, AutoFormer's integration of architecture search with vision transformer design marks a notable advancement in the field of automated machine learning architecture optimization. Its contributions offer both immediate performance benefits and long-term research opportunities in the field of efficient computational model development.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Minghao Chen (37 papers)
Houwen Peng (36 papers)
Jianlong Fu (91 papers)
Haibin Ling (142 papers)

Citations (230)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - microsoft/Cream: This is a collection of our NAS and Vision Transformer work. (1,597 stars)