When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

Published 3 Jun 2021 in cs.CV and cs.LG | (2106.01548v3)

Abstract: Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\% and +11.0\% top-1 accuracy on ImageNet for ViT-B/16 and Mixer-B/16, respectively, with the simple Inception-style preprocessing). We show that the improved smoothness attributes to sparser active neurons in the first few layers. The resultant ViTs outperform ResNets of similar size and throughput when trained from scratch on ImageNet without large-scale pre-training or strong data augmentations. Model checkpoints are available at \url{https://github.com/google-research/vision_transformer}.

Abstract PDF Upgrade to Chat

Citations (300)

View on Semantic Scholar

Summary

The paper demonstrates that SAM optimizer reduces the need for pre-training by boosting top-1 accuracy, with ViT-B/16 and Mixer-B/16 improving by +5.3% and +11.0% over ResNets.
The study reveals that applying SAM enhances model robustness, as shown by a 9.9% increase in ImageNet-C accuracy for ViT-B/16.
The research highlights that efficient training with SAM can eliminate the dependence on heavy data augmentations, benefiting resource-constrained deep learning applications.

Evaluation of Vision Transformers and MLP-Mixers without Pre-Training or Strong Data Augmentations

The paper explores the performance capabilities of Vision Transformers (ViTs) and Multi-Layer Perceptron Mixers (MLP-Mixers) when trained without pre-training on large datasets or strong data augmentations. Traditionally, these models benefit from such strategies to enhance accuracy and robustness, against established convolution-based architectures like ResNets. However, this dependency poses challenges in terms of data and computational demands. The study adopts a novel sharpness-aware optimizer, known as SAM, to address these challenges by smoothing the loss landscapes during training.

Methodology

The authors analyze ViTs and MLP-Mixers through the lens of loss landscape geometry. It is noted that these models converge at sharp local minima, potentially affecting their generalization capacity. The primary goal is to reduce the reliance on large-scale pre-training and complex data augmentations by employing the SAM optimizer. SAM seeks to minimize not only the training error but also the sharpness of the local minima, achieving a more generalized solution.

Key Results

The results are compelling, demonstrating:

Improved Performance Without Pre-Training: ViTs and MLP-Mixers trained with the SAM optimizer achieve significant enhancements in top-1 accuracy on benchmarks like ImageNet, surpassing similar-sized ResNets. Specifically, ViT-B/16 and Mixer-B/16 see improvements of +5.3% and +11.0% in top-1 accuracy, respectively.
Enhanced Robustness: Models exhibit enhanced robustness across various tasks, with notable improvements in performance when faced with adversarial and corrupted datasets. For instance, an increase of 9.9% in ImageNet-C accuracy for ViT-B/16 suggests a robustness improvement owing to SAM.
Efficiency Gains: The models achieve these improvements without the need for massive datasets for pre-training or sophisticated augmentation strategies, offering significant efficiency gains.

Theoretical Implications

The findings challenge the conventional dependency on pre-training and data augmentations, positing that optimization strategies like SAM can be effective for training architecture inherently devoid of strong inductive biases. This insight is crucial for the computational efficiency of deploying deep learning models, particularly when data resources and computational power are constrained.

Future Directions

This research opens opportunities to develop more efficient training paradigms that leverage optimization techniques over pre-training or data-heavy augmentations. Potential areas of exploration include:

Exploration of SAM's Parameters: Further investigation into SAM's parameter space could optimize its application across different architectures and scales.
Broader Applications: Extending this approach to other types of models and tasks could provide a broader understanding of its applicability.
Real-Time and Resource-Constrained Environments: Applying these methods in environments with limited resources may reveal practical insights and drive real-world deployment of these models.

Conclusion

This research underscores the potential of optimization techniques, such as SAM, in enhancing model robustness and accuracy without the computational overhead of traditional training approaches. The implications for both theoretical research and practical deployment of vision-based architectures are significant, paving the way for more efficient and versatile AI systems.

By addressing the key challenges and proposing a viable alternative, this work contributes to the ongoing evolution of neural architecture design and training methodologies.

Markdown