Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models (2402.08473v1)
Abstract: Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-LLM. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, volume 33, pages 12449–12460, 2020.
- Understanding robustness of transformers for image classification. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Geometric analysis and metric learning of instruction embeddings. In 2022 International Joint Conference on Neural Networks (IJCNN), 2022.
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
- A survey on adversarial attacks and defences. CAAI Transactions on Intelligence Technology, 6(1):25–45, March 2021.
- What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2019.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019, pages 4171–4186, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Towards interpretable multimodal predictive models for early mortality prediction of hemorrhagic stroke patients. AMIA Summits on Translational Science Proceedings, 2023.
- ImageBind: One embedding space to bind them all. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023.
- Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
- Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, pages 4182–4192. PMLR, 2020.
- Pyramid adversarial training improves vit performance. arXiv preprint arXiv:2111.15121, 2022.
- Openclip. OpenCLIP, 2021.
- The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
- Adversarial attacks and adversarial robustness in computational pathology. Nature Communications, 13:5711, 2022.
- Graph transformer for recommendation. In SIGIR ’23: Proceedings of the 46th Internatonal ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1680–1689, July 2023.
- Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2015.
- Visual instruction tuning. arXiv:2304.08485, 2023.
- Detecting compromised iot devices using autoencoders with sequential hypothesis testing. In 2023 IEEE International Conference on Big Data (BigData), 2023.
- Exploring generalization in deep learning. In NIPS, 2017.
- Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Sundar Pichai. An important next step on our ai journey. https://blog.google/technology/ai/bard-google-ai-search-updates/, 2023.
- Visual adversarial examples jailbreak aligned large language models. In 2nd AdvML Frontiers workshop at 40th International Conference on Machine Learning, volume 202. PMLR, 2023.
- Understanding and improving robustness of vision transformers through patch-based negative augmentation. arXiv preprint arXiv:2110.07858, 2023.
- Improving language understanding by generative pre-training. OpenAI, 2018.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, volume 139. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
- Intriguing equivalence structures of the embedding space of vision transformers. arXiv preprint arXiv:2401.15568, 2024.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2014.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
- Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017a.
- Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2017b.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.