Visformer: The Vision-friendly Transformer (2104.12533v5)

Published 26 Apr 2021 in cs.CV

Abstract: The past year has witnessed the rapid development of applying the Transformer module to vision problems. While some researchers have demonstrated that Transformer-based models enjoy a favorable ability of fitting data, there are still growing number of evidences showing that these models suffer over-fitting especially when the training data is limited. This paper offers an empirical study by performing step-by-step operations to gradually transit a Transformer-based model to a convolution-based model. The results we obtain during the transition process deliver useful messages for improving visual recognition. Based on these observations, we propose a new architecture named Visformer, which is abbreviated from the `Vision-friendly Transformer'. With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy, and the advantage becomes more significant when the model complexity is lower or the training set is smaller. The code is available at https://github.com/danczs/Visformer.

Authors (6)

Zhengsu Chen (6 papers)
Lingxi Xie (137 papers)
Jianwei Niu (42 papers)
Xuefeng Liu (64 papers)
Longhui Wei (40 papers)
Qi Tian (314 papers)

Citations (185)

View on Semantic Scholar

Summary

The paper introduces a hybrid architecture that blends Transformer and convolutional methods to improve visual recognition while addressing overfitting with limited training data.
It pioneers systematic modifications such as global average pooling, step-wise patch embedding, and stage-wise design to optimize model stability and performance.
Comparative experiments show Visformer outperforms models like DeiT and ResNet on ImageNet, setting a new roadmap for hybrid vision architectures in practical applications.

Visformer: The Vision-friendly Transformer

The paper "Visformer: The Vision-friendly Transformer" explores the potential of Transformers to enhance visual recognition tasks, alongside presenting the Visformer architecture. The authors highlight the increasing utility of Transformer-based models in vision problems, juxtaposed against the traditionally convolution-oriented approaches. This paper makes critical observations about the limitations of existing Transformer models, particularly their susceptibility to over-fitting in scenarios of limited training data, and proposes the Visformer architecture as a solution.

Architectural Innovations

Central to this research is the empirical transition from Transformer-based models to convolution-based models through eight systematic steps, aiming to derive insights for optimizing visual recognition models:

Token vs. Pooling: The paper initiates with removing the classification token, substituting it with a global average pooling technique. This modification is demonstrated to enhance the base performance significantly.
Step-wise Patch Embedding: This involves breaking down large patch flattening into sequential smaller embeddings, which better preserve positional priors within patches.
Stage-wise Design: By structuring the network into stages, akin to residual networks, the architecture leverages local priors in vision data. The paper underscores the importance of such designs for robust training.
Normalization Techniques: The substitution of LayerNorm with BatchNorm reflects the adaptability in borrowing elements from CNNs for improved learning stability and performance.
Convolution Integration: Introducing $3\times3$ convolutions into Transformer models emphasizes capturing local contexts, which is advantageous for high-resolution data processing.
Position Embedding Removal: The paper discusses the marginal impact of this transition, showcasing the redundancy of position embeddings when local spatial context is effectively harnessed by convolutions.
Feed-forward Layer Dynamics: This transformation posits the reduced efficacy of self-attention layers when handling extensive token counts across large feature resolutions.
Network Shape Adjustments: The concluding step involves aligning fit and architectural emphasis with traditional convolutional networks, refining depth, width, and complexity in network stages.

Comparative Analysis with DeiT and ResNets

The Visformer architecture, evaluated against models like DeiT-S and ResNet-50, reveals its robustness under both base and elite training settings. Notably, it achieves superior performance benchmarks on the ImageNet dataset under comparable computational constraints. Its scalability is particularly evident in scenarios with limited data subsets, where it maintains stable recognition accuracy, outperforming both pure Transformer and convolutional predecessors.

VisformerV2 and FP16 Overflow Issues

Building on Visformer, the paper introduces VisformerV2, optimizing configurations such as stage allocations and depth-width balancing. Further enhancements correct overflow issues in Transformers under half-precision floating-point operations. By adopting a revised attention score scaling method, VisformerV2 efficiently avoids overflows without computational performance penalties.

Implications and Future Directions

The insights from transitioning between convolutional and Transformer architectures establish a roadmap for hybrid model development. Visformer's architecture not only optimizes current networks but also sets the stage for its application in downstream tasks, such as object detection on the COCO dataset. The key contributions of Visformer are multiple: it exemplifies a model that achieves high lower-bound performance, i.e., efficiency with limited data, and high upper-bound performance, i.e., potential under extensive data and computational settings.

The implications of this research are profound in both theoretical modeling and practical applications of AI, as it suggests an integrative approach for future vision architectures. Researchers would benefit from exploring further refinements and applications of such hybrid models, addressing challenges in dynamic data environments and expanding their utility across varying computational scales.

PDF Markdown

Related Papers

GitHub

GitHub - danczs/Visformer (132 stars)