Efficient Training of Visual Transformers with Small Datasets (2106.03746v2)

Published 7 Jun 2021 in cs.CV and cs.LG

Abstract: Visual Transformers (VTs) are emerging as an architectural paradigm alternative to Convolutional networks (CNNs). Differently from CNNs, VTs can capture global relations between image elements and they potentially have a larger representation capacity. However, the lack of the typical convolutional inductive bias makes these models more data-hungry than common CNNs. In fact, some local properties of the visual domain which are embedded in the CNN architectural design, in VTs should be learned from samples. In this paper, we empirically analyse different VTs, comparing their robustness in a small training-set regime, and we show that, despite having a comparable accuracy when trained on ImageNet, their performance on smaller datasets can be largely different. Moreover, we propose a self-supervised task which can extract additional information from images with only a negligible computational overhead. This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce. Our task is used jointly with the standard (supervised) training and it does not depend on specific architectural choices, thus it can be easily plugged in the existing VTs. Using an extensive evaluation with different VTs and datasets, we show that our method can improve (sometimes dramatically) the final accuracy of the VTs. Our code is available at: https://github.com/yhlleo/VTs-Drloc.

Authors (6)

Yahui Liu (40 papers)
Enver Sangineto (34 papers)
Wei Bi (62 papers)
Nicu Sebe (270 papers)
Bruno Lepri (120 papers)
Marco De Nadai (26 papers)

Citations (145)

View on Semantic Scholar

Summary

The paper introduces a dense relative localization task that significantly enhances VT performance on small datasets.
The study rigorously compares second-generation VT architectures, revealing efficiency drops when convolutional biases are absent.
Extensive experiments show accuracy improvements exceeding 40 percentage points, broadening practical applications of Visual Transformers.

Overview of Efficient Training of Visual Transformers with Small Datasets

This paper addresses a pivotal challenge in the deployment of Visual Transformers (VTs) for computer vision tasks: the substantial data requirement. While VTs are acclaimed for their ability to model global image relationships and offer significant representational capacity, they lack the convolutional inductive biases of Convolutional Neural Networks (CNNs), rendering them more demanding in terms of data volume for effective training. This research investigates the behavior of different VTs under small dataset conditions and introduces a novel auxiliary self-supervised task to enhance their training robustness.

The authors present three primary contributions. First, they empirically evaluate several second-generation VT architectures under limited data conditions. Although these architectures demonstrate comparable performance on large-scale datasets like ImageNet, their efficacy varies considerably when applied to smaller datasets. This highlights a crucial limitation in current VT designs, which necessitate abundant data for learning local properties that CNNs inherently capture through their architecture.

Second, the authors propose an auxiliary task rooted in self-supervised learning, targeting the enhancement of training efficiency for VTs. This task focuses on the densely sampled spatial relations within an image, requiring the network to predict geometric distances between token embeddings. This task, termed "dense relative localization," introduces a negligible computational burden, providing a means to encode spatial information dynamically across the token embeddings.

Third, through extensive experimentation involving diverse VT architectures and datasets, the authors demonstrate that integrating their self-supervised task consistently improves VT performance, with accuracy gains being particularly notable under constrained data regimes.

Strong Numerical Results and Bold Claims

The paper provides compelling numerical support for the proposed method. It shows that adding the auxiliary task contributes to significant accuracy improvements, sometimes exceeding 40 percentage points in small dataset scenarios. This dramatic enhancement underscores the efficacy of the self-supervised approach in compensating for the data-driven weaknesses inherent in many VT architectures.

Implications and Future Prospects

The introduction of a self-supervised paradigm tuned to VT architectures has profound theoretical and practical implications. Theoretically, it sheds light on the capabilities of VTs to comprehend spatial information without needing traditional inductive biases. Practically, this development can broaden the applicability of VTs in domains with restricted access to large labeled datasets, such as specialized medical imaging or niche industrial applications.

As the field progresses, future research might explore the integration of additional self-supervised paradigms, possibly merging with other paradigms such as contrastive or clustering methods. Additionally, examining the scalability of the proposed approach with larger VT models or alternative architectures could further cement its role as a cornerstone in VT training strategies.

Concluding Remarks

This paper effectively highlights the data efficiency challenges faced by Visual Transformers and offers an innovative auxiliary task as a tangible solution. By bridging the gap between data-hungry VT architectures and their practical deployment scenarios, it paves the way for more versatile and adaptable VT applications in real-world settings.

PDF Markdown

Related Papers

GitHub

GitHub - yhlleo/VTs-Drloc: NeurIPS 2021, Official codes for "Efficient Training of Visual Transformers with Small Datasets". (138 stars)

YouTube

Show All Videos