Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConvNets Match Vision Transformers at Scale (2310.16764v1)

Published 25 Oct 2023 in cs.CV, cs.LG, and cs.NE

Abstract: Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of images often used for training foundation models. We consider pre-training compute budgets between 0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width from the NFNet model family. We observe a log-log scaling law between held out loss and compute budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Samuel L. Smith (27 papers)
  2. Andrew Brock (21 papers)
  3. Leonard Berrada (14 papers)
  4. Soham De (38 papers)
Citations (16)

Summary

ConvNets Match Vision Transformers at Scale

The paper "ConvNets Match Vision Transformers at Scale," authored by Samuel L. Smith, Andrew Brock, Leonard Berrada, and Soham De from Google DeepMind, critically examines the prevalent assertion in computer vision that Vision Transformers (ViTs) surpass Convolutional Neural Networks (ConvNets) at scale. This paper explores an extensive empirical comparison between a ConvNet architecture and ViTs, utilizing substantial computational resources for pre-training on a large-scale dataset.

Introduction

ConvNets have been foundational in the field of computer vision, achieving early successes and dominating benchmarks for nearly a decade. However, with the advent of ViT, there has been a shift towards transformer-based architectures for image recognition tasks. The current consensus suggests that ViTs exhibit superior scaling properties when trained on substantial datasets collected from the web. This paper challenges this prevailing view by rigorously evaluating the NFNet model family, a state-of-the-art ConvNet architecture, pre-trained on the JFT-4B dataset to compare performance against ViTs under equivalent computational budgets.

Methodology

The NFNet models, including various configurations from F0 to F7+, were pre-trained on the JFT-4B dataset, encompassing around 4 billion labeled images. The pre-training was carried out across various compute budgets extending from 0.4k to 110k TPU-v4 core hours. The training methodology adhered to established practices using SGD with Momentum, Adaptive Gradient Clipping (AGC), and distinct image resolutions during training and evaluation.

A log-log scaling law between held-out validation loss and compute budget was prominent, akin to trends observed in LLMing with transformers. Furthermore, the paper explored optimal epoch budgets and learning rates across varying model sizes, adhering to the principle that both should be scaled proportionally with compute budgets, resonating with earlier findings in LLMing.

Results

Upon fine-tuning the pre-trained NFNets on ImageNet, the performance metrics revealed that NFNets were on par with ViT counterparts in terms of Top-1 accuracy. Notably, the NFNet-F7+ model achieved a Top-1 accuracy of 90.4% after fine-tuning with repeated augmentation, demonstrating significant improvements over previous NFNet benchmarks without additional data. This parity in performance was maintained even when considering comparable compute budgets, highlighting the efficacy of ConvNets at scale.

Analysis

The results underscore that, with sufficient computational resources and dataset sizes, ConvNets can match the performance of ViTs. This finding contests the prevailing assumption that ViTs inherently possess superior scaling properties. The linear trend observed in the log-log scaling between validation loss and compute budget prompts a re-evaluation of current biases towards transformer architectures.

Discussion

The paper's implications are twofold. Practically, it suggests that researchers can still rely on ConvNets for competitive performance in large-scale vision tasks, provided adequate compute and data. Theoretically, it challenges the narrative favoring transformers, advocating for a more nuanced view that considers the critical role of compute and data irrespective of the model architecture. Future developments in AI will likely explore hybrid models, leveraging the strengths of both ConvNets and transformers.

Conclusion

In conclusion, the paper robustly demonstrates that ConvNets can indeed match the performance of ViTs at scale, challenging the orthodoxy in contemporary computer vision research. By providing rigorous empirical evidence, the paper invites the research community to re-assess the comparative advantages of these architectures, potentially fostering a more balanced approach to developing future AI models.

Youtube Logo Streamline Icon: https://streamlinehq.com