Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models
Abstract: Recent advances in the development of vision-LLMs (VLMs) are yielding remarkable success in recognizing visual semantic content, including impressive instances of compositional image understanding. Here, we introduce the novel task of Visual Data-Type Identification, a basic perceptual skill with implications for data curation (e.g., noisy data-removal from large datasets, domain-specific retrieval) and autonomous vision (e.g., distinguishing changing weather conditions from camera lens staining). We develop two datasets consisting of animal images altered across a diverse set of 27 visual data-types, spanning four broad categories. An extensive zero-shot evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced performance landscape. While VLMs are reasonably good at identifying certain stylistic \textit{data-types}, such as cartoons and sketches, they struggle with simpler data-types arising from basic manipulations like image rotations or additive noise. Our findings reveal that (i) model scaling alone yields marginal gains for contrastively-trained models like CLIP, and (ii) there is a pronounced drop in performance for the largest auto-regressively trained VLMs like OpenFlamingo. This finding points to a blind spot in current frontier VLMs: they excel in recognizing semantic content but fail to acquire an understanding of visual data-types through scaling. By analyzing the pre-training distributions of these models and incorporating data-type information into the captions during fine-tuning, we achieve a significant enhancement in performance. By exploring this previously uncharted task, we aim to set the stage for further advancing VLMs to equip them with visual data-type understanding. Code and datasets are released at https://github.com/bethgelab/DataTypeIdentification.
- Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
- More context, less distraction: Visual classification by inferring and conditioning on contextual attributes. arXiv preprint arXiv:2308.01313, 2023.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
- Sourav Banerjee. Animal image dataset: 90 different animals, 2023. URL https://www.kaggle.com/datasets/iamsouravbanerjee/animal-image-dataset-90-different-animals. Accessed: 2023-07-10.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems, 32, 2019.
- Fuyu-8b: A multimodal architecture for ai agents, 2023. URL https://www.adept.ai/blog/fuyu-8b. Accessed: 2023-11-18.
- Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images. arXiv preprint arXiv:2303.07274, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Pug: Photorealistic and semantically controllable synthetic data for representation learning. arXiv preprint arXiv:2308.03977, 2023.
- The representational hierarchy in human and artificial visual systems in the presence of object-scene regularities. PLOS Computational Biology, 19(4):e1011086, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Going beyond nouns with vision & language models using synthetic data. arXiv preprint arXiv:2303.17590, 2023.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020.
- Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2818–2829, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
- Carla: An open urban driving simulator. In Conference on robot learning, pp. 1–16. PMLR, 2017.
- Dense and aligned captions (dac) promote compositional reasoning in vl models. arXiv preprint arXiv:2305.19595, 2023.
- Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pp. 6216–6234. PMLR, 2022.
- A brief review of domain adaptation. Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pp. 877–894, 2021.
- Leo Gao. Multiple choice normalization in lm evaluation, 2023. URL https://blog.eleuther.ai/multiple-choice-normalization/. Accessed: 2023-11-18.
- Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
- Signal processing for computer vision, 1995.
- Adbench: Anomaly detection benchmark. Advances in Neural Information Processing Systems, 35:32142–32159, 2022.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349, 2021.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. arXiv preprint arXiv:2306.14610, 2023.
- A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp. 4904–4916. PMLR, 2021.
- Text encoders are performance bottlenecks in contrastive vision-language models. arXiv preprint arXiv:2305.14897, 2023.
- Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023a.
- Grounding language models to images for multimodal inputs and outputs. arXiv preprint arXiv:2301.13823, 2023b.
- Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pp. 5637–5664. PMLR, 2021.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/idefics. Accessed: 2023-09-18.
- Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023b.
- Chunyuan Li. Large multimodal models: Notes on cvpr 2023 tutorial. arXiv preprint arXiv:2306.14895, 2023.
- Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pp. 5542–5550, 2017.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22511–22521, 2023d.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
- Visual chirality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12295–12303, 2020.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- Contrastive vision-language alignment makes efficient instruction learner. arXiv preprint arXiv:2311.17945, 2023b.
- Energy-based out-of-distribution detection. Advances in neural information processing systems, 33:21464–21475, 2020.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786, 2021.
- Crepe: Can vision-language foundation models reason compositionally? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10910–10921, 2023.
- On exposing the challenging long tail in future prediction of traffic actors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13147–13157, 2021.
- Wedge: A multi-weather autonomous driving dataset built from generative vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3317–3326, 2023.
- Visual classification via description from large language models. arXiv preprint arXiv:2210.07183, 2022.
- Data augmentation for improving deep learning in image classification problem. In 2018 international interdisciplinary PhD workshop (IIPhDW), pp. 117–122. IEEE, 2018.
- On interaction between augmentations and corruptions in natural corruption robustness. Advances in Neural Information Processing Systems, 34:3571–3583, 2021.
- NLPÂ Team MosaicML. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. URL https://www.mosaicml.com/blog/mpt-7b. Accessed: 2023-09-18.
- On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in neural information processing systems, 14, 2001.
- Quality not quantity: On the interaction between dataset design and robustness of clip. Advances in Neural Information Processing Systems, 35:21455–21469, 2022.
- The role of context in object recognition. Trends in cognitive sciences, 11(12):520–527, 2007.
- Teaching clip to count to ten. arXiv preprint arXiv:2302.12066, 2023.
- Deep learning for anomaly detection: A review. ACM computing surveys (CSUR), 54(2):1–38, 2021.
- Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. arXiv preprint arXiv:2112.07566, 2021.
- Moment matching for multi-source domain adaptation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 1406–1415, 2019.
- The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
- Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483, 2023.
- Combined scaling for zero-shot transfer learning. Neurocomputing, pp. 126658, 2023.
- Are multimodal models robust to image and text perturbations? arXiv preprint arXiv:2212.08044, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Automatic data augmentation for generalization in reinforcement learning. Advances in Neural Information Processing Systems, 34:5402–5415, 2021.
- On the connection between pre-training data diversity and fine-tuning robustness. arXiv preprint arXiv:2307.12532, 2023.
- Substance or style: What does your image embedding know? arXiv preprint arXiv:2307.05610, 2023.
- Foolbox: A python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131, 2017.
- Fixing data augmentation to improve adversarial robustness. arXiv preprint arXiv:2103.01946, 2021a.
- Data augmentation can improve robustness. Advances in Neural Information Processing Systems, 34:29935–29948, 2021b.
- Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- Wandering within a world: Online contextualized few-shot learning. arXiv preprint arXiv:2007.04546, 2020.
- Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14318–14328, 2022.
- Multitask prompted training enables zero-shot task generalization, 2022.
- Is a caption worth a thousand images? a study on representation learning. In The Eleventh International Conference on Learning Representations, 2022.
- Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
- Improving robustness against common corruptions by covariate shift adaptation. Advances in neural information processing systems, 33:11539–11551, 2020.
- Effective robustness against natural distribution shifts for models with different training data. arXiv preprint arXiv:2302.01381, 2023.
- A survey on image data augmentation for deep learning. Journal of big data, 6(1):1–48, 2019.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9, 2015.
- Measuring robustness to natural distribution shifts in image classification. Advances in Neural Information Processing Systems, 33:18583–18599, 2020.
- Improving deep learning with generic data augmentation. In 2018 IEEE symposium series on computational intelligence (SSCI), pp. 1542–1547. IEEE, 2018.
- Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5238–5248, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Sus-x: Training-free name-only transfer of vision-language models. arXiv preprint arXiv:2211.16198, 2022.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- A survey on semi-supervised learning. Machine learning, 109(2):373–440, 2020.
- Dissecting image crops. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9741–9750, 2021.
- Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Can linguistic knowledge improve multimodal alignment in vision-language pretraining? arXiv preprint arXiv:2308.12898, 2023.
- Learning robust global representations by penalizing local predictive power. Advances in Neural Information Processing Systems, 32, 2019.
- Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
- Compositional generalization from first principles. arXiv preprint arXiv:2307.05596, 2023.
- When are lemons purple? the concept association bias of clip. arXiv preprint arXiv:2212.12043, 2022.
- Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334, 2021.
- What you see is what you read? improving text-image alignment evaluation. arXiv preprint arXiv:2305.10400, 2023.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Universal domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2720–2729, 2019.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2022.
- Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 818–833. Springer, 2014.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18123–18133, 2022.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
- Maximum-entropy adversarial data augmentation for improved generalization and robustness. Advances in Neural Information Processing Systems, 33:14435–14447, 2020.
- Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations. arXiv preprint arXiv:2207.00221, 2022.
- Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 13001–13008, 2020.
- Long-tail prediction uncertainty aware trajectory planning for self-driving vehicles. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), pp. 1275–1282. IEEE, 2022.
- Ood-probe: A neural interpretation of out-of-domain generalization. arXiv preprint arXiv:2208.12352, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.