Better plain ViT baselines for ImageNet-1k

Published 3 May 2022 in cs.CV | (2205.01580v1)

Abstract: It is commonly accepted that the Vision Transformer model requires sophisticated regularization techniques to excel at ImageNet-1k scale data. Surprisingly, we find this is not the case and standard data augmentation is sufficient. This note presents a few minor modifications to the original Vision Transformer (ViT) vanilla training setting that dramatically improve the performance of plain ViT models. Notably, 90 epochs of training surpass 76% top-1 accuracy in under seven hours on a TPUv3-8, similar to the classic ResNet50 baseline, and 300 epochs of training reach 80% in less than one day.

Abstract PDF Upgrade to Chat

Citations (99)

View on Semantic Scholar

Summary

The paper's main contribution is showing that simple training tweaks and data augmentation can boost plain ViT performance, reaching over 76% top-1 accuracy in under seven hours.
The authors implement methodological changes like reducing batch size, replacing the class token with global average pooling, and adopting fixed sine-cosine positional embeddings.
Experimental results demonstrate that these streamlined adjustments enable competitive performance against methods like ResNet50 while lowering computational complexity.

Better Plain ViT Baselines for ImageNet-1k

The paper "Better plain ViT baselines for ImageNet-1k" presents an intriguing reevaluation of the Vision Transformer (ViT) training framework on the ImageNet-1k dataset. Contrary to prevailing assumptions that effective ViT performance requires sophisticated regularization and vast pre-training datasets, this study demonstrates that standard data augmentation alone can significantly enhance performance.

Key Contributions

The authors identify and implement a series of minor modifications to the ViT training regimen that markedly improve its efficacy. The primary focus is to simplify the ViT baseline while retaining competitive performance when compared to established methods like ResNet50 and other contemporary ViT implementations.

Training Efficiency: Notably, the study achieves over 76% top-1 accuracy in a compressed timeframe of under seven hours using a TPUv3-8, after 90 training epochs. When extended to 300 epochs, the model reaches an impressive 80% top-1 accuracy in less than a day.
Simplified Training Regime: The authors leverage several straightforward adjustments:
- Batch size reduction from 4096 to 1024.
- Adoption of global average pooling (GAP) instead of a class token.
- Fixed 2D sine-cosine positional embeddings in place of learned embeddings.
- Incorporation of RandAugment and Mixup for data augmentation, albeit at conservative levels.
Baseline ViT Setup: A detailed experimental setup maintaining the original architecture underscores the simplicity and effectiveness of carefully chosen hyperparameters without resorting to advanced augmentation or regularization techniques such as dropout, stochastic depth, or high-resolution fine-tuning.

Experimental Validation

The empirical results, represented graphically in Figure 1 of the paper, reveal the efficiency of the proposed adjustments. Comparisons with ResNet50 and other ViT variants confirm the enhanced performance of the refined baseline. Furthermore, Table 1 evaluates the impact of individual modifications using ablation studies, underscoring their cumulative beneficial effect on model accuracy.

The enhancements are benchmarked through comprehensive testing on standard metrics, showcasing improvements across various evaluations such as ReaL and ImageNet v2, evidencing their robustness.

Implications and Future Directions

The implications of these findings are twofold:

Simplification in Model Training: This study presents a paradigm shift by illustrating that simplicity in model training can yield substantial benefits in performance. It challenges the community to rethink complex methodologies, thereby potentially reducing computational costs and broadening accessibility to high-performing models.
Practicality for Wider Usage: By reaching competitive accuracy with streamlined methods, the research paves the way for more efficient deployment of ViT models in practical applications, especially where computational resources are constrained.

Given the current trajectory, future research could explore the scalability of these simplified methods in other domains and with diverse datasets. Additionally, it invites further exploration into optimizing hyperparameter settings and architectural choices without amplifying complexity unnecessarily.

In conclusion, this paper provides a valuable contribution to the landscape of computer vision by offering a well-founded, simplified approach to ViT training that promises both practical and theoretical benefits. This could inspire a reassessment of entrenched practices regarding model training methodologies within the research community.