Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models (2310.08577v3)

Published 12 Oct 2023 in cs.CV, cs.CL, and cs.LG

Abstract: Recent advances in the development of vision-LLMs (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic \textit{data-types}, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. Code and datasets are released at https://github.com/bethgelab/DataTypeIdentification.

Analysis of Data-Type Understanding in Vision-LLMs

The paper "Visual data-type understanding does not emerge from scaling vision-LLMs" presents an insightful investigation into the capabilities of current vision-LLMs (VLMs) in identifying visual data-types. This problem, as defined by the authors, involves recognizing alterations to images that affect style, geometric orientation, or pixel quality without altering the semantic content. This task has practical applications in domains such as data curation and autonomous vision, where distinguishing between naturally occurring changes and artifacts is critical.

Key Findings

The authors introduce two datasets, SyntheticTypeIdent and NaturalTypeIdent, to evaluate VLMs on visual data-type identification. These datasets cover 27 data-types across four broad categories: geometric, pixel, style, and semantic. The primary focus is on the inherent limitations of VLMs in discerning these data-types. Through zero-shot evaluations of various models, including contrastive VLMs like CLIP and auto-regressive models like IDEFICS, the paper elucidates pronounced discrepancies in VLM performance based on data-type complexity.

Critically, the paper highlights that:

  1. While VLMs like CLIP demonstrate competence in recognizing style-related data-types, they falter on simpler, pixel-based data-types such as image rotations or noise.
  2. Model scaling, often seen as an avenue for enhancing performance, yields marginal improvements for contrastively-trained models and might even degrade performance for large auto-regressive models.
  3. The observed scaling laws imply that orders of magnitude increases in parameter counts are necessary to reach practical identification levels for data-types.

Implications and Future Directions

The findings emphasize a clear limitation in the current approach to scaling models as a means to improve their robustness and flexibility in understanding visual data-types. This has significant implications for the deployment of VLMs in real-world applications, where understanding the context and not merely the content is essential.

From a theoretical perspective, the paper highlights a disconnect in the compositional understanding capabilities of VLMs. Despite advancements in LLMs demonstrating robust compositional reasoning in text, this capability does not seem to transfer seamlessly to vision-language tasks. Thus, while large language priors provide powerful semantic grounding, they fall short of enabling fine-grained visual understanding necessary for nuanced tasks like data-type identification.

Practically, these insights necessitate a reevaluation of training paradigms. The paper suggests that only when data-type information is integrated during the training process can models be expected to exhibit significant improvements in the task. This observation opens up new research avenues focused on enhancing training data curation and architectural modifications to better capture data-type characteristics.

Furthermore, the established datasets and methodologies for the benchmark form a solid foundation for future research aimed at optimizing VLMs for diverse visual contexts. Implementing data-type aware training processes and potentially novel augmentation techniques could spur the development of more versatile models.

In conclusion, the paper provides a comprehensive analysis of the current limitations of VLMs in understanding visual data-types and calls for strategic shifts in training methodologies. These adjustments are vital for advancing model generality and utility, particularly in critical applications such as autonomous systems and large-scale data management.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (119)
  1. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
  2. More context, less distraction: Visual classification by inferring and conditioning on contextual attributes. arXiv preprint arXiv:2308.01313, 2023.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  4. Sourav Banerjee. Animal image dataset: 90 different animals, 2023. URL https://www.kaggle.com/datasets/iamsouravbanerjee/animal-image-dataset-90-different-animals. Accessed: 2023-07-10.
  5. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
  6. Fuyu-8b: A multimodal architecture for ai agents, 2023. URL https://www.adept.ai/blog/fuyu-8b. Accessed: 2023-11-18.
  7. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. arXiv preprint arXiv:2303.07274, 2023.
  8. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  9. Pug: Photorealistic and semantically controllable synthetic data for representation learning. arXiv preprint arXiv:2308.03977, 2023.
  10. The representational hierarchy in human and artificial visual systems in the presence of object-scene regularities. PLOS Computational Biology, 19(4):e1011086, 2023.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  12. Going beyond nouns with vision & language models using synthetic data. arXiv preprint arXiv:2303.17590, 2023.
  13. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
  14. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2818–2829, 2023.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  16. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  17. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  18. Carla: An open urban driving simulator. In Conference on robot learning, pp.  1–16. PMLR, 2017.
  19. Dense and aligned captions (dac) promote compositional reasoning in vl models. arXiv preprint arXiv:2305.19595, 2023.
  20. Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pp. 6216–6234. PMLR, 2022.
  21. A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pp.  877–894, 2021.
  22. Leo Gao. Multiple choice normalization in lm evaluation, 2023. URL https://blog.eleuther.ai/multiple-choice-normalization/. Accessed: 2023-11-18.
  23. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  24. Signal processing for computer vision, 1995.
  25. Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems, 35:32142–32159, 2022.
  26. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  27. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  28. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
  29. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  8340–8349, 2021.
  30. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  31. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  32. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. arXiv preprint arXiv:2306.14610, 2023.
  33. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
  34. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp. 4904–4916. PMLR, 2021.
  35. Text encoders are performance bottlenecks in contrastive vision-language models. arXiv preprint arXiv:2305.14897, 2023.
  36. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023a.
  37. Grounding language models to images for multimodal inputs and outputs. arXiv preprint arXiv:2301.13823, 2023b.
  38. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp. 5637–5664. PMLR, 2021.
  39. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  40. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/idefics. Accessed: 2023-09-18.
  41. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022.
  42. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  43. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023b.
  44. Chunyuan Li. Large multimodal models: Notes on cvpr 2023 tutorial. arXiv preprint arXiv:2306.14895, 2023.
  45. Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp.  5542–5550, 2017.
  46. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  47. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22511–22521, 2023d.
  48. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
  49. Visual chirality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12295–12303, 2020.
  50. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  51. Contrastive vision-language alignment makes efficient instruction learner. arXiv preprint arXiv:2311.17945, 2023b.
  52. Energy-based out-of-distribution detection. Advances in neural information processing systems, 33:21464–21475, 2020.
  53. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  54. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
  55. Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10910–10921, 2023.
  56. On exposing the challenging long tail in future prediction of traffic actors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  13147–13157, 2021.
  57. Wedge: A multi-weather autonomous driving dataset built from generative vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3317–3326, 2023.
  58. Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
  59. Data augmentation for improving deep learning in image classification problem. In 2018 international interdisciplinary PhD workshop (IIPhDW), pp.  117–122. IEEE, 2018.
  60. On interaction between augmentations and corruptions in natural corruption robustness. Advances in Neural Information Processing Systems, 34:3571–3583, 2021.
  61. NLP Team MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL https://www.mosaicml.com/blog/mpt-7b. Accessed: 2023-09-18.
  62. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems, 14, 2001.
  63. Quality not quantity: On the interaction between dataset design and robustness of clip. Advances in Neural Information Processing Systems, 35:21455–21469, 2022.
  64. The role of context in object recognition. Trends in cognitive sciences, 11(12):520–527, 2007.
  65. Teaching clip to count to ten. arXiv preprint arXiv:2302.12066, 2023.
  66. Deep learning for anomaly detection: A review. ACM computing surveys (CSUR), 54(2):1–38, 2021.
  67. Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. arXiv preprint arXiv:2112.07566, 2021.
  68. Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  1406–1415, 2019.
  69. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
  70. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
  71. Combined scaling for zero-shot transfer learning. Neurocomputing, pp.  126658, 2023.
  72. Are multimodal models robust to image and text perturbations? arXiv preprint arXiv:2212.08044, 2022.
  73. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  74. Automatic data augmentation for generalization in reinforcement learning. Advances in Neural Information Processing Systems, 34:5402–5415, 2021.
  75. On the connection between pre-training data diversity and fine-tuning robustness. arXiv preprint arXiv:2307.12532, 2023.
  76. Substance or style: What does your image embedding know? arXiv preprint arXiv:2307.05610, 2023.
  77. Foolbox: A python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131, 2017.
  78. Fixing data augmentation to improve adversarial robustness. arXiv preprint arXiv:2103.01946, 2021a.
  79. Data augmentation can improve robustness. Advances in Neural Information Processing Systems, 34:29935–29948, 2021b.
  80. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  81. Wandering within a world: Online contextualized few-shot learning. arXiv preprint arXiv:2007.04546, 2020.
  82. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14318–14328, 2022.
  83. Multitask prompted training enables zero-shot task generalization, 2022.
  84. Is a caption worth a thousand images? a study on representation learning. In The Eleventh International Conference on Learning Representations, 2022.
  85. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  86. Improving robustness against common corruptions by covariate shift adaptation. Advances in neural information processing systems, 33:11539–11551, 2020.
  87. Effective robustness against natural distribution shifts for models with different training data. arXiv preprint arXiv:2302.01381, 2023.
  88. A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
  89. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  90. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1–9, 2015.
  91. Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583–18599, 2020.
  92. Improving deep learning with generic data augmentation. In 2018 IEEE symposium series on computational intelligence (SSCI), pp.  1542–1547. IEEE, 2018.
  93. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5238–5248, 2022.
  94. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  95. Sus-x: Training-free name-only transfer of vision-language models. arXiv preprint arXiv:2211.16198, 2022.
  96. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  97. A survey on semi-supervised learning. Machine learning, 109(2):373–440, 2020.
  98. Dissecting image crops. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9741–9750, 2021.
  99. Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
  100. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  101. Can linguistic knowledge improve multimodal alignment in vision-language pretraining? arXiv preprint arXiv:2308.12898, 2023.
  102. Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.
  103. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
  104. Compositional generalization from first principles. arXiv preprint arXiv:2307.05596, 2023.
  105. When are lemons purple? the concept association bias of clip. arXiv preprint arXiv:2212.12043, 2022.
  106. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334, 2021.
  107. What you see is what you read? improving text-image alignment evaluation. arXiv preprint arXiv:2305.10400, 2023.
  108. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  109. Universal domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  2720–2729, 2019.
  110. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  111. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2022.
  112. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833. Springer, 2014.
  113. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18123–18133, 2022.
  114. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  115. Maximum-entropy adversarial data augmentation for improved generalization and robustness. Advances in Neural Information Processing Systems, 33:14435–14447, 2020.
  116. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221, 2022.
  117. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  13001–13008, 2020.
  118. Long-tail prediction uncertainty aware trajectory planning for self-driving vehicles. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pp.  1275–1282. IEEE, 2022.
  119. Ood-probe: A neural interpretation of out-of-domain generalization. arXiv preprint arXiv:2208.12352, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Vishaal Udandarao (20 papers)
  2. Max F. Burg (7 papers)
  3. Samuel Albanie (81 papers)
  4. Matthias Bethge (103 papers)
Citations (6)