Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 102 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 25 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 472 tok/s Pro
Kimi K2 196 tok/s Pro
2000 character limit reached

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models (2402.08473v1)

Published 13 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-LLM. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, pages 23716–23736, 2022.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems, volume 33, pages 12449–12460, 2020.
  3. Understanding robustness of transformers for image classification. In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  4. Geometric analysis and metric learning of instruction embeddings. In 2022 International Joint Conference on Neural Networks (IJCNN), 2022.
  5. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  6. A survey on adversarial attacks and defences. CAAI Transactions on Intelligence Technology, 6(1):25–45, March 2021.
  7. What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, 2019.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019, pages 4171–4186, 2019.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  11. Towards interpretable multimodal predictive models for early mortality prediction of hemorrhagic stroke patients. AMIA Summits on Translational Science Proceedings, 2023.
  12. ImageBind: One embedding space to bind them all. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15180–15190, 2023.
  13. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
  14. Olivier Henaff. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the 37th International Conference on Machine Learning, pages 4182–4192. PMLR, 2020.
  15. Pyramid adversarial training improves vit performance. arXiv preprint arXiv:2111.15121, 2022.
  16. Openclip. OpenCLIP, 2021.
  17. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 2020.
  18. Adversarial attacks and adversarial robustness in computational pathology. Nature Communications, 13:5711, 2022.
  19. Graph transformer for recommendation. In SIGIR ’23: Proceedings of the 46th Internatonal ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1680–1689, July 2023.
  20. Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2015.
  21. Visual instruction tuning. arXiv:2304.08485, 2023.
  22. Detecting compromised iot devices using autoencoders with sequential hypothesis testing. In 2023 IEEE International Conference on Big Data (BigData), 2023.
  23. Exploring generalization in deep learning. In NIPS, 2017.
  24. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2019.
  25. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  26. Sundar Pichai. An important next step on our ai journey. https://blog.google/technology/ai/bard-google-ai-search-updates/, 2023.
  27. Visual adversarial examples jailbreak aligned large language models. In 2nd AdvML Frontiers workshop at 40th International Conference on Machine Learning, volume 202. PMLR, 2023.
  28. Understanding and improving robustness of vision transformers through patch-based negative augmentation. arXiv preprint arXiv:2110.07858, 2023.
  29. Improving language understanding by generative pre-training. OpenAI, 2018.
  30. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, volume 139. PMLR, 2021.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  32. Intriguing equivalence structures of the embedding space of vision transformers. arXiv preprint arXiv:2401.15568, 2024.
  33. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2014.
  34. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
  35. Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  36. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017a.
  37. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2017b.
  38. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  39. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets