Going deeper with Image Transformers (2103.17239v2)

Published 31 Mar 2021 in cs.CV

Abstract: Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and optimization of such dedicated transformers. We make two transformers architecture changes that significantly improve the accuracy of deep transformers. This leads us to produce models whose performance does not saturate early with more depth, for instance we obtain 86.5% top-1 accuracy on Imagenet when training with no external data, we thus attain the current SOTA with less FLOPs and parameters. Moreover, our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency, in the setting with no additional training data. We share our code and models.

Citations (887)

View on Semantic Scholar

Summary

The paper introduces LayerScale, a learnable diagonal matrix integrated after each residual block to stabilize and deepen transformer training.
The paper presents CaiT, a novel class-attention mechanism that separates patch self-attention from class representation, enhancing processing efficiency.
Empirical evaluations demonstrate that the approach reaches state-of-the-art ImageNet performance with reduced FLOPs and parameters compared to traditional CNNs.

Introduction to Image Transformers

Recent advancements in computer vision have led to significant progress in image classification tasks. The success of deep learning models, particularly those using residual architectures like ResNet, has been noteworthy. Residual architectures introduce a sequence of functions that update the network's internal representation at each layer, with a component known as the residual branch playing a pivotal role. Vision Transformers (ViTs), which have emerged as a potent alternative to conventional convolutional neural networks (CNNs), follow a similar residual structure while employing self-attention mechanisms. Nevertheless, the optimization of these models, especially as they become deeper, is a crucial area that had so far not been thoroughly studied.

Enhancing Transformer Training with LayerScale

Researchers have formulated LayerScale, a novel approach to improve the training dynamics for deep image transformers. By integrating a learnable diagonal matrix just after each residual block in the model, the training process is refined, which in turn enables the model to benefit from increased depth. This approach, known as LayerScale, supports the stable convergence of models with greater depth and offers an enhanced alternative to earlier methods. For instance, a transformer model with this approach, when trained without any external data, attains a top-1 accuracy of 86.5% on ImageNet, thereby reaching state-of-the-art performance with fewer FLOPs and parameters.

Introducing Class-Attention Layers

Another significant contribution of the paper is the introduction of class-attention layers within the transformers, dubbed CaiT (Class-Attention in Image Transformers). This novel architecture distinguishes self-attention between image patches from the class-attention process, which is crucial for summarizing the content into a class output. This separation mitigates any contradictory learning objectives and leads to more efficient processing, ultimately enhancing the model's ability to handle the class information element.

Empirical Validation and Model Analysis

The researchers rigorously evaluated their approaches, confirming the advantages provided by LayerScale and the specialized class-attention architecture. The CaiT models established new benchmarks on ImageNet, ImageNet with reassessed labels, and ImageNet-V2 match frequency. The experimental analysis also included various control experiments, such as the impact of different LayerScale initialization strategies, which provided additional insights into the benefits of the proposed methods.

Conclusion: A New Frontier in Image Classification

The findings of this paper mark a pivotal moment in image classification, showcasing the viability of transformer networks to outshine conventional CNNs in terms of both accuracy and efficiency. By dissecting the complexities of deep transformer training and introducing innovative solutions like LayerScale and CaiT, the researchers have paved the way for future advancements in the domain. With the provided code and models, the AI community is well-equipped to further explore and refine these innovative architectures.

PDF Markdown