Escaping the Big Data Paradigm with Compact Transformers (2104.05704v4)

Published 12 Apr 2021 in cs.CV and cs.LG

Abstract: With the rise of Transformers as the standard for language processing, and their advancements in computer vision, there has been a corresponding growth in parameter size and amounts of training data. Many have come to believe that because of this, transformers are not suitable for small sets of data. This trend leads to concerns such as: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we aim to present an approach for small-scale learning by introducing Compact Transformers. We show for the first time that with the right size, convolutional tokenization, transformers can avoid overfitting and outperform state-of-the-art CNNs on small datasets. Our models are flexible in terms of model size, and can have as little as 0.28M parameters while achieving competitive results. Our best model can reach 98% accuracy when training from scratch on CIFAR-10 with only 3.7M parameters, which is a significant improvement in data-efficiency over previous Transformer based models being over 10x smaller than other transformers and is 15% the size of ResNet50 while achieving similar performance. CCT also outperforms many modern CNN based approaches, and even some recent NAS-based approaches. Additionally, we obtain a new SOTA result on Flowers-102 with 99.76% top-1 accuracy, and improve upon the existing baseline on ImageNet (82.71% accuracy with 29% as many parameters as ViT), as well as NLP tasks. Our simple and compact design for transformers makes them more feasible to study for those with limited computing resources and/or dealing with small datasets, while extending existing research efforts in data efficient transformers. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Compact-Transformers.

PDF Abstract

Overview of "Escaping the Big Data Paradigm with Compact Transformers"

The paper "Escaping the Big Data Paradigm with Compact Transformers" addresses the prevalent notion that Transformers are inherently unsuitable for small datasets due to their dependence on large amounts of training data and extensive computational resources. It identifies the growing trend of using larger parameter models trained on expansive datasets, which excludes researchers with limited resources and constrains applications in domains where data availability is scarce.

Key Contributions

The authors introduce Compact Transformers (CCT), designed to perform efficiently on smaller datasets, thereby bridging the gap between CNNs and Transformers:

ViT-Lite: A reduced variant of the Vision Transformer (ViT), enabling viability for small-scale learning by adjusting depth and patch sizes.
Compact Vision Transformers (CVT): Incorporating a sequence pooling strategy, this model pools over output tokens, eliminating the reliance on the class token typically used in Transformer architectures.
Compact Convolutional Transformer (CCT): Enhancements include a convolutional tokenizer allowing better spatial invariance and flexibility in input image sizes, removing dependencies on positional embedding.

Strong Numerical Results

The proposed models demonstrate compelling performance:

CIFAR-10: CCT achieves 98\% accuracy with only 3.7M parameters, surpassing state-of-the-art CNNs like ResNet50 with significantly fewer resources.
Flowers-102: Achieves a new SOTA result with 99.76\% top-1 accuracy, exemplifying its efficacy for small high-resolution datasets.
ImageNet: Demonstrates competitive performance, with 82.71\% accuracy using 29\% fewer parameters than ViT.

Implications and Future Developments

The primary implication of this work is the empowerment of researchers working under constrained computational environments, promoting inclusivity in AI research. CCTs enable high performance on small datasets, broadening the application scope in areas like medicine and rare-event scientific research where large data collection is impractical.

Theoretically, the work suggests that architectural innovations, such as convolutional tokenization and sequence pooling, can mitigate the prevalent data dependency issue in Transformer models. The model’s reduced reliance on positional embeddings encourages further exploration of hybrid architectures blending advantages from both CNNs and Transformers.

Future research could aim to refine these architectures further, potentially exploring hybrid models in other domains such as NLP, as brief experiments included in the paper indicate promising results. Additionally, investigations could focus on utilizing these compact models in real-time applications or edge devices, capitalizing on their reduced computational demands.

In conclusion, the paper provides a significant step towards making advanced Transformer architectures more accessible and applicable across various domains with limited datasets, promoting broader engagement and innovation in AI research.