ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2106.03348v4)

Published 7 Jun 2021 in cs.CV

Abstract: Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and is able to learn robust feature representation for objects at various scales. Moreover, in each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network. Consequently, it has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. Experiments on ImageNet as well as downstream tasks prove the superiority of ViTAE over the baseline transformer and concurrent works. Source code and pretrained models will be available at GitHub.

PDF Abstract

Overview of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

The paper introduces ViTAE, a novel architecture that integrates intrinsic inductive biases (IB) into the Vision Transformer framework to improve its capability in visual tasks. The inherent challenge with existing vision transformers, such as ViT, lies in their treatment of images as a 1D sequence of tokens, which areas lacking in intrinsic biases like scale invariance and locality that are naturally present in convolutional neural networks (CNNs). These biases are essential for efficiently modeling local visual structures and adapting to scale variance across different visual tasks. The authors propose ViTAE as an advanced vision transformer model that incorporates these biases through architectural innovations, thereby enhancing its performance on image classification and downstream tasks.

Key Innovations and Architecture

ViTAE accomplishes its objectives through the introduction of two types of transformer cells: Reduction Cell (RC) and Normal Cell (NC).

Reduction Cell (RC): Integrates multi-scale context and local information into tokens. This cell operates with multiple convolutions having diverse dilation rates, thus embedding rich multi-scale context into each token. The parallel convolution block in RC provides locality inductive bias, enabling the model to better learn local structure features.
Normal Cell (NC): It further models the locality and long-range dependencies within these tokens. The NC shares a parallel convolution structure with the RC that efficiently fuses local and global features so that global dependencies can be learned collaboratively alongside locality.

The architecture efficiently reduces and structures input data into tokens using these cells, with the Reduction Cell focusing on multi-scale and local features, while the Normal Cell balances the learning of local features against long-range dependencies typical of transformer architectures. This design allows ViTAE to incorporate inherent biases typically learned by CNNs while maintaining the attention mechanism's capacity to learn global patterns.

Results and Implications

Numerically, ViTAE demonstrates superior performance over baseline vision transformers like T2T-ViT and other contemporary models such as DeiT and LocalViT. On the ImageNet benchmark, ViTAE with fewer parameters outperforms these models, illustrating both its efficiency and effectiveness. For instance, ViTAE-T achieves a top-1 accuracy of 75.3% with only 4.8 million parameters, surpassing models with considerably larger parameter counts. The paper outlines consistent improvements across various benchmarks, showing that incorporating intrinsic IBs can significantly enhance data efficiency and training speed without sacrificing performance.

The implications of this architecture are noteworthy for advancing vision transformers’ efficiency in scenarios with limited data and computational resources. Practically, it implies that transformers can be designed to include beneficial properties traditionally attributed to CNNs, thus offering a more holistic approach to feature extraction. This demonstrates the potential for vision transformers with intrinsic inductive bias to achieve high performance without the heavy data and training regime typically required by pure transformer models.

Future Directions

The architecture's reliance on convolutional operations suggests future exploration in other types of biases beyond scale-invariance and locality which could be integrated, such as those addressing viewpoint invariance or contextual awareness that could further bolster model performance across diverse tasks.

Moreover, the paper opens avenues for scaling the ViTAE model to larger configurations and datasets, which could potentially harness the transformer’s strengths for even greater performance gains in tasks involving complex or large-scale data interactions.

This research highlights the successful synergy that can be achieved by weaving together the principles of CNNs and transformers, offering a strategic blueprint for advancing AI systems in fields requiring visual recognition and understanding.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Yufei Xu (24 papers)
Qiming Zhang (31 papers)
Jing Zhang (730 papers)
Dacheng Tao (826 papers)

Citations (301)

View on Semantic Scholar

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias (2106.03348v4)

Overview of ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Key Innovations and Architecture

Results and Implications

Future Directions

Related Papers