Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When Do We Not Need Larger Vision Models? (2403.13043v2)

Published 19 Mar 2024 in cs.CV

Abstract: Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on Scales (S$2$), whereby a pre-trained and frozen smaller vision model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation. Notably, S$2$ achieves state-of-the-art performance in detailed understanding of MLLM on the V* benchmark, surpassing models such as GPT-4V. We examine the conditions under which S$2$ is a preferred scaling approach compared to scaling on model size. While larger models have the advantage of better generalization on hard examples, we show that features of larger vision models can be well approximated by those of multi-scale smaller models. This suggests most, if not all, of the representations learned by current large pre-trained models can also be obtained from multi-scale smaller models. Our results show that a multi-scale smaller model has comparable learning capacity to a larger model, and pre-training smaller models with S$2$ can match or even exceed the advantage of larger models. We release a Python package that can apply S$2$ on any vision model with one line of code: https://github.com/bfshi/scaling_on_scales.

When Do We Not Need Larger Vision Models?

Introduction

The pursuit of increasingly larger vision models has been a dominant trend in the arena of artificial intelligence research, driven by the belief that scaling up model size directly correlates with improved performance across a spectrum of visual understanding tasks. This paper, through an extensive analysis, introduces an alternative scaling strategy—Scaling on Scales (S2^2)—challenging the conventional wisdom that "bigger is always better." It demonstrates that a strategic scaling of image inputs, without proportionally increasing model parameters, can not only compete with but in certain instances, surpass the performance of heftier counterparts.

The Concept of S2^2

S2^2 diverges from traditional model scaling approaches by focusing on manipulating the input scale rather than the complexity of the model itself. By employing a pre-trained vision model across multiple image scales, S2^2 yields a multi-scale representation that inherently captures a broad spectrum of visual details—ranging from granular to global perspectives. The intriguing part is that these enriched representations are achieved without any alterations to the model architecture or an increase in parameters. This process involves interpolating images to varying scales and subsequently, pooling and concatenating the generated features to forge a comprehensive multi-scaled representation.

Empirical Validation

Extensive experiments across several benchmarks—including classification, segmentation, depth estimation, multimodal LLMs, and robotic manipulation—reveal the efficacy of S2^2. Remarkably, models enhanced with S2^2 consistently demonstrated competitive or superior performance relative to their larger counterparts, showcasing the viability of S2^2 as a scalable and efficient alternative to blindly scaling model size. This is illustrated through the state-of-the-art results achieved on the V^\ast benchmark for detailed visual understanding in multimodal LLMs, where S2^2 scaled models outperformed notable entities such as GPT-4V and commercial models.

Analyzing Model Performance and Capacity

A deeper investigation into why larger models outperform in some instances pointed towards their better generalization on rare or ambiguous examples. However, when analyzing the representational overlap between smaller models with S2^2 and larger models, it was found that the former can approximate the features of the latter quite effectively. This revelation, denoting a similar capacity for learning between smaller S2^2 models and larger models, suggests that with appropriate training strategies, smaller models could achieve or exceed the generalization capabilities and performance efficiencies of their larger counterparts.

Practical Implications and Future Outlook

The findings invigorate a discourse on the necessity of model scaling strategies in advancing visual understanding. By offering an alternative that circumvents the computational and resource-intensive demands of larger models, S2^2 unlocks new potentials for efficient and scalable AI development. It posits a future where focusing on the input dimensions, like image scales, could be as impactful, if not more, as scaling model sizes. This invites further exploration into scale-selective processing and parallel processing of single images, promising directions that could redefine efficiency and performance benchmarks in visual computing tasks.

Conclusion

Scaling on Scales (S2^2) emerges as a compelling paradigm, challenging the enduring convention of associating model performance with size. Through rigorous analysis and empirical evidence, this work elucidates the potential of S2^2 to redefine the metrics of efficiency and performance in visual understanding tasks, heralding a shift towards more pragmatic and resource-conscious approaches in the development of AI models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  4. Revisiting resnets: Improved training and scaling strategies. Advances in Neural Information Processing Systems, 34:22614–22627, 2021.
  5. Are we done with imagenet? arXiv preprint arXiv:2006.07159, 2020.
  6. Window attention is bugged: How not to interpolate position embeddings. arXiv preprint arXiv:2311.05613, 2023.
  7. Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 357–366, 2021.
  10. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022a.
  11. Memorize to generalize: on the necessity of interpolation in high dimensional linear regression. In Conference on Learning Theory, pages 5528–5560. PMLR, 2022b.
  12. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  14. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  15. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  16. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  17. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005.
  18. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  19. Fast feature pyramids for object detection. IEEE transactions on pattern analysis and machine intelligence, 36(8):1532–1545, 2014.
  20. Fast and accurate model scaling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 924–932, 2021.
  21. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  22. Scalable pre-training of large autoregressive image models. arXiv preprint arXiv:2401.08541, 2024.
  23. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021.
  24. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
  25. Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954–959, 2020.
  26. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  27. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  28. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  29. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  30. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  31. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  32. Exploring long-sequence masked autoencoders. arXiv preprint arXiv:2210.07224, 2022.
  33. Computational modelling of visual attention. Nature reviews neuroscience, 2(3):194–203, 2001.
  34. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  35. Mpvit: Multi-path vision transformer for dense prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7287–7296, 2022.
  36. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
  37. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023b.
  38. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  39. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  40. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  41. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
  42. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  43. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  44. David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60:91–110, 2004.
  45. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  46. What do larger image classifiers memorise? arXiv preprint arXiv:2310.05337, 2023.
  47. The three r’s of computer vision: Recognition, reconstruction and reorganization. Pattern Recognition Letters, 72:4–14, 2016.
  48. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749, 2021.
  49. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  50. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  51. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  52. Robot learning with sensorimotor pre-training. arXiv preprint arXiv:2306.10007, 2023a.
  53. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023b.
  54. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  55. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
  56. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  57. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  58. Hiera: A hierarchical vision transformer without the bells-and-whistles. arXiv preprint arXiv:2306.00989, 2023.
  59. Top-down visual attention from analysis by synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2102–2112, 2023.
  60. Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012.
  61. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  62. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  63. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  64. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  65. Efficientnetv2: Smaller models and faster training. In International conference on machine learning, pages 10096–10106. PMLR, 2021.
  66. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  67. Qwen Team. Introducing qwen-vl, Jan 2024. URL https://qwenlm.github.io/blog/qwen-vl/.
  68. Efficient object localization using convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 648–656, 2015.
  69. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  70. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
  71. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  72. Resnet strikes back: An improved training procedure in timm. arXiv preprint arXiv:2110.00476, 2021.
  73. V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135, 2023.
  74. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  75. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742, 2018.
  76. Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pages 418–434, 2018.
  77. A theory of usable information under computational constraints. arXiv preprint arXiv:2002.10689, 2020.
  78. Focal attention for long-range interactions in vision transformers. Advances in Neural Information Processing Systems, 34:30008–30022, 2021.
  79. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  80. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  81. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  82. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  83. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021a.
  84. Elf: accelerate high-resolution mobile deep vision with content-aware parallel offloading. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking, pages 201–214, 2021b.
  85. Unleashing text-to-image diffusion models for visual perception. arXiv preprint arXiv:2303.02153, 2023.
  86. Li Zhaoping. Understanding vision: theory, models, and data. Oxford University Press (UK), 2014.
  87. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Baifeng Shi (17 papers)
  2. Ziyang Wu (21 papers)
  3. Maolin Mao (1 paper)
  4. Xin Wang (1306 papers)
  5. Trevor Darrell (324 papers)
Citations (28)