Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

Published 30 Oct 2023 in cs.CV and cs.LG | (2310.19909v2)

Abstract: Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-LLMs, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https://github.com/hsouri/Battle-of-the-Backbones

Abstract PDF Upgrade to Chat

Citations (47)

View on Semantic Scholar

Summary

The paper demonstrates that supervised backbones like ConvNeXt-Base, SwinV2-Base, and CLIP ViT-Base consistently outperform other models across diverse computer vision tasks.
The study reveals that vision transformers benefit significantly from larger data scales, while self-supervised learning shows promise under high-resource conditions.
The analysis underscores a performance-scaling trade-off and advocates for standardized benchmarks to guide the development of universal backbone architectures.

Insights from "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks"

The paper "Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks" provides a meticulous empirical assessment of diverse pretrained deep learning models, examining their performance in a wide set of computer vision tasks. This study evaluates various neural network backbones including convolutional networks, vision transformers, and other advanced architectures which have been trained using different pretraining strategies like supervised, self-supervised, and vision-LLMs.

The paper benchmarks over 1,500 training runs of various models across tasks such as image classification, object detection and segmentation, out-of-distribution generalization, and image retrieval. Through these experiments, the study attempts to delineate the suitability of different pretrained models for specific downstream tasks, considering factors such as architecture, pretraining methodology, and dataset size.

Main Findings:

Superior Architectures: The study identifies supervised ConvNeXt-Base, supervised SwinV2-Base (trained on ImageNet-21k), and CLIP ViT-Base as the top-performing models across multiple tasks. These models outperform other architectures, including vanilla vision transformers, especially in the smaller model category where ConvNeXt-Tiny and SwinV2-Tiny emerged as effective choices.
Advantages of Scale and Pretraining: Vision transformers (ViTs) appear to benefit significantly from increased data and model scale, more so than convolutional neural networks (CNNs). Furthermore, while supervised learning on large datasets remains dominant, self-supervised learning (SSL) methods show promise when compared in similar conditions, suggesting that advancements in SSL should be pursued with larger datasets and advanced architectures.
Applicability Across Tasks: The study finds strong positive correlations between model performance on different tasks and settings, indicating that the best-performing backbones in one task may also excel in others. This underscores the potential of universal models or general-purpose backbone architectures.
Performance-Scaling Trade-offs: Despite scalability leading to better performance, the negative correlation between throughput and z-score performance signifies a trade-off, indicating higher computational demands for better model efficacy.
Retrieval Tasks and Unique Challenges: Notably, retrieval tasks presented lower correlation with other tasks, particularly affected by underperformance of certain generative pretraining models like MAE and Stable Diffusion.
Adversarial Robustness: CNNs and SSL-pretrained models are generally more adversarially robust compared to their counterparts trained with supervised learning.

Implications and Future Directions:

This paper underscores the importance of comparing pretrained backbones under standardized, controlled conditions to derive meaningful insights for computer vision systems. The results suggest avenues for improving SSL methods through larger datasets and sophisticated architectures. Additionally, with evidence supporting the strategy of using foundational models, practitioners might push further in developing universal backbone architectures that perform consistently well across diverse applications.

Future research could explore even larger vision transformer architectures and their transfer capabilities at greater scales, potentially unveiling new practical upper limits of these models. Moreover, the interplay of generative pretraining techniques with discriminative tasks merits deeper inquiry, given the mixed results observed with models like Stable Diffusion.

Overall, "Battle of the Backbones" provides a comprehensive evaluation that will aid researchers and practitioners alike in making informed choices in model selection and pretraining strategies for a broad array of computer vision tasks. The thoroughness of this benchmark lays a foundation for expanding the horizons of pretrained models in AI and highlights the evolving landscape of backbone architectures.

Markdown