Early Convolutions Help Transformers See Better (2106.14881v3)

Published 28 Jun 2021 in cs.CV

Abstract: Vision transformer (ViT) models exhibit substandard optimizability. In particular, they are sensitive to the choice of optimizer (AdamW vs. SGD), optimizer hyperparameters, and training schedule length. In comparison, modern convolutional neural networks are easier to optimize. Why is this the case? In this work, we conjecture that the issue lies with the patchify stem of ViT models, which is implemented by a stride-p p*p convolution (p=16 by default) applied to the input image. This large-kernel plus large-stride convolution runs counter to typical design choices of convolutional layers in neural networks. To test whether this atypical design choice causes an issue, we analyze the optimization behavior of ViT models with their original patchify stem versus a simple counterpart where we replace the ViT stem by a small number of stacked stride-two 3*3 convolutions. While the vast majority of computation in the two ViT designs is identical, we find that this small change in early visual processing results in markedly different training behavior in terms of the sensitivity to optimization settings as well as the final model accuracy. Using a convolutional stem in ViT dramatically increases optimization stability and also improves peak performance (by ~1-2% top-1 accuracy on ImageNet-1k), while maintaining flops and runtime. The improvement can be observed across the wide spectrum of model complexities (from 1G to 36G flops) and dataset scales (from ImageNet-1k to ImageNet-21k). These findings lead us to recommend using a standard, lightweight convolutional stem for ViT models in this regime as a more robust architectural choice compared to the original ViT model design.

Citations (681)

View on Semantic Scholar

Summary

The paper demonstrates that using early convolution layers in ViT models significantly improves optimization stability by reducing sensitivity to hyperparameters.
The study shows a 1-2% increase in top-1 accuracy on ImageNet, validating the effectiveness of a traditional convolutional stem over the patchify approach.
The paper finds that convolution-based early processing accelerates convergence, offering practical benefits for both extensive architecture search and fine-tuning.

Early Convolutions Help Transformers See Better: An In-Depth Analysis

This paper investigates the optimizability issues of Vision Transformer (ViT) models, which have shown sensitivity to optimizer choice, hyperparameter tuning, and training schedules compared to Convolutional Neural Networks (CNNs). The authors propose that the problem stems primarily from the ViT models' patchify stem, commonly implemented via a stride-16 convolution applied to the input image. This atypical design choice contrasts with the stride-two $3 \times 3$ convolutions typically found in CNNs, which favors local processing.

To test their hypothesis, the authors replace the ViT's patchify stem with a traditional convolutional stem built from a small stack of stride-two $3 \times 3$ convolutions. The replacement, while a minor alteration affecting only the initial processing layers, markedly improves training behavior and model accuracy.

Key Findings

Optimization Stability: By substituting the patchify stem with a convolutional counterpart, the resulting ViT models, referred to as \vit{C}, demonstrate considerably enhanced optimization stability. This includes reduced sensitivity to optimization parameters such as learning rates and weight decays and the ability to converge efficiently under both AdamW and SGD optimizers.
Performance Gains: The introduction of a convolutional stem results in an increase of 1-2% in top-1 accuracy on ImageNet-1k datasets. This improvement is observed across varying model sizes and dataset scales, ranging from ImageNet-1k to ImageNet-21k, indicating its robustness and efficacy.
Convergence Speed: Models with the convolutional stem converge faster compared to the original patchify-based ViT models (\vit{P}). The reduced time to convergence is particularly advantageous for performing extensive architecture search or for quick prototyping.
Comparative Advantage: In scenarios that involve training on larger datasets like ImageNet-21k followed by fine-tuning on ImageNet-1k, the \vit{C} models outperform contemporary CNNs, countering previous observations where original ViT struggled to match state-of-the-art CNNs.

Implications and Future Directions

The paper suggests a shift towards integrating early-stage convolutions in the design of ViT models, which can bridge the stability and performance advantage of CNNs while maintaining the representational capabilities provided by transformer-based architectures. These insights carry significant implications for developing hybrid models that leverage the strengths of both CNNs and Transformers.

Future research could explore understanding the underlying reasons for these improvements at a theoretical level. Additionally, exploration into larger and more complex datasets and models will ascertain if these benefits persist. Moreover, investigating the integration of other convolutional features within deeper layers of ViT models might further enhance their performance.

Conclusion

The study underscores the importance of early visual processing in transformer models and illustrates how even minor modifications, like incorporating a few convolutional layers, can yield significant dividends concerning optimizability and accuracy. This work not only contributes to improving the design and training of ViT models but also opens avenues for building more robust AI systems that effectively combine the best practices from CNNs and Transformers.