Vision Transformer for Small-Size Datasets (2112.13492v1)

Published 27 Dec 2021 in cs.CV

Abstract: Recently, the Vision Transformer (ViT), which applied the transformer structure to the image classification task, has outperformed convolutional neural networks. However, the high performance of the ViT results from pre-training using a large-size dataset such as JFT-300M, and its dependence on a large dataset is interpreted as due to low locality inductive bias. This paper proposes Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA), which effectively solve the lack of locality inductive bias and enable it to learn from scratch even on small-size datasets. Moreover, SPT and LSA are generic and effective add-on modules that are easily applicable to various ViTs. Experimental results show that when both SPT and LSA were applied to the ViTs, the performance improved by an average of 2.96% in Tiny-ImageNet, which is a representative small-size dataset. Especially, Swin Transformer achieved an overwhelming performance improvement of 4.08% thanks to the proposed SPT and LSA.

Authors (3)

Seung Hoon Lee (4 papers)
Seunghyun Lee (60 papers)
Byung Cheol Song (11 papers)

Citations (201)

View on Semantic Scholar

Summary

The paper introduces innovative Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA) methods to address ViT's limitations on small datasets.
It demonstrates performance boosts with nearly 3% accuracy improvement on Tiny-ImageNet and roughly 4% on CIFAR-100 through enhanced spatial and attention strategies.
These approaches pave the way for efficient transformer models in domains with limited data, such as medical imaging, by reinforcing local inductive biases.

An Analytical Perspective on "Vision Transformer for Small-Size Datasets"

The Vision Transformer (ViT) model, initially introduced as a breakthrough by integrating the transformer architecture into image classification tasks, has demonstrated substantial success by surpassing performance benchmarks set by traditional Convolutional Neural Networks (CNNs). However, the inherent dependency of ViTs on large datasets poses significant limitations when operating with smaller datasets. The paper "Vision Transformer for Small-Size Datasets" by Seung Hoon Lee, Seunghyun Lee, and Byung Cheol Song addresses this limitation by proposing two novel methodologies: Shifted Patch Tokenization (SPT) and Locality Self-Attention (LSA).

Key Contributions

The paper identifies two main challenges that impair the performance of standard ViTs when dealing with small datasets: poor tokenization and sub-optimal attention mechanisms. These issues primarily stem from inadequate locality inductive biases. To counter these challenges, the authors propose SPT and LSA.

Shifted Patch Tokenization (SPT):
- SPT enhances the tokenization process by incorporating spatial shifts in patch extraction. Inspired by the Temporal Shift Module, SPT dynamically augments each image by shifting it across multiple spatial directions before patch extraction. This mechanism effectively enlarges the receptive field, allowing for a richer embedding of spatial relationships among neighboring pixels, thereby counteracting the low locality inductive bias in ViTs.
Locality Self-Attention (LSA):
- LSA refines the self-attention mechanism by focusing on the geometric distribution of attention scores. By optimizing these distributions through learnable temperature scaling and diagonal masking, which excludes self-token relations, LSA ensures attention is more narrowly concentrated, enhancing locality and improving overall attention efficacy.

Experimental Insights

Empirical studies conducted with the Tiny-ImageNet dataset reveal substantial performance improvements when applying SPT and LSA. For instance, integrating both techniques results in an average accuracy increase of 2.96%, with a notable 4.08% improvement in the case of Swin Transformer.

Further testing on various datasets demonstrated consistent performance gains. Applying SPT and LSA led to an approximate 4% improvement on CIFAR-100 and nearly 4% on Tiny-ImageNet over baseline ViT configurations. The analysis of mid-sized datasets, such as ImageNet, also noted improvements though to a slightly lesser degree, emphasizing the cost-effectiveness of these techniques in reducing pre-training data requirements.

Theoretical and Practical Implications

The methodologies proposed could significantly influence the development of efficient vision transformers, particularly for use in settings with limited data. These include applications in specialized fields like medical imaging, where high-quality labeled data are often scarce. The introduction of stronger locality inductive biases through SPT and LSA also extends the theoretical understanding of ViT architectures and how they might adapt more efficiently to data constraints.

Future investigations could explore further modifications and combinations of these techniques, including adaptations for various data modalities and the reduction of computational overheads. Moreover, examining the integration of SPT and LSA into hybrid architectures that bridge CNN and transformer elements might yield insights into creating more versatile and robust models.

Conclusion

By innovatively addressing the locality inductive bias deficiency in ViTs, the paper makes significant strides toward the effective application of transformer-based architectures on smaller datasets. The demonstrated improvements affirm the transformative potential of SPT and LSA, setting a foundation for further exploration and optimization in transformer models within the computer vision field.

PDF Markdown