Scaling Vision Transformers (2106.04560v2)

Published 8 Jun 2021 in cs.CV, cs.AI, and cs.LG

Abstract: Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer LLMs have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy of the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well for few-shot transfer, for example, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class.

PDF Abstract

Scaling Vision Transformers

The paper presents a comprehensive paper on the scaling laws of Vision Transformers (ViT), specifically focusing on how scaling compute, model size, and data affects their performance on various image classification tasks. The research explores ViT models ranging from five million to two billion parameters, examines datasets from one million to three billion images, and evaluates compute budgets from less than one TPUv3 core-day to beyond 10,000 core-days.

Key Findings

The paper's core contribution is the characterization of the performance-compute frontier for ViT models, showing that performance follows a double-saturating power law as a function of total training compute. In particular:

Compute, Model, and Data Scaling: Increased compute allocation, model size, and dataset size collectively yield improved representation quality. However, there is evidence of performance saturation at the highest compute levels.
Bottleneck Analysis: Representation quality is often bottlenecked by model size at lower compute levels and by dataset size at higher compute levels. Smaller models do not benefit significantly from additional data or compute, while large models continue to improve with increased data.
Sample Efficiency: Larger models are more sample efficient, achieving similar performance with fewer images seen during training. This trend holds across different evaluation setups, including few-shot transfer and full dataset fine-tuning.

Method Improvements

The paper also introduces several methodological improvements to optimize the training of large ViT models:

Customized Weight Decay: Decoupling the weight decay for the head (final linear layer) and the body (remaining layers) of the model results in significant improvements in few-shot transfer performance.
Memory Optimization: Removing the additional class token and employing alternative pooling methods like Multihead Attention Pooling (MAP) helps reduce memory usage without sacrificing performance.
Optimized Optimizers: The paper evaluates and adapts memory-efficient optimizers, such as a modified Adafactor optimizer, to manage the increased model parameter storage requirements effectively.

Numerical Results

Remarkable numerical results demonstrate the efficacy of the approach:

The ViT model with two billion parameters achieves a new state-of-the-art 90.45% top-1 accuracy on ImageNet.
Significant few-shot transfer performance is observed, with 84.86% top-1 accuracy on ImageNet using only 10 examples per class.
The paper introduces ViT-G/14, which outperforms previous models significantly across various benchmarks, establishing new state-of-the-art results in multiple image classification tasks.

Practical and Theoretical Implications

The paper has several implications for the future of vision transformers and AI research:

Model Scaling: The clear trade-offs identified between model size, compute, and data provide a roadmap for developing even larger and more effective vision transformers.
Sample Efficiency: The observed efficiency in samples for larger models could drive future AI applications in scenarios where labeled data is scarce.
Generalization of Scaling Laws: Extending these scaling laws to other vision tasks and models could generalize the findings, aiding in the design of more robust and efficient AI systems.

Future Directions

Several future research directions emerge from this paper:

Scaling Beyond Current Limits: Exploring models beyond two billion parameters and datasets larger than three billion images could uncover new insights.
Real-World Applications: Applying these scaling laws and new models in real-world applications and fine-tuning for specific tasks could yield significant practical benefits.
Cross-Domain Generalization: Extending the scaling laws to other vision tasks, such as object detection or segmentation, would generalize the findings and potentially lead to breakthroughs in these fields.

Conclusion

The paper provides a rigorous analysis of the scaling properties of Vision Transformers, offering valuable insights into how compute, model size, and data interact to govern performance. The findings have significant implications for the future of AI research and practice, guiding the efficient design of next-generation vision models.