Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models (2402.08473v1)

Published 13 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-LLM. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.

References (39)

Citations (2)

View on Semantic Scholar

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models (2402.08473v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Authors (4)

Tweets

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models (2402.08473v1)

Collections

Summary

Paper Prompts

Follow-up Questions

Related Papers

Authors (4)

Tweets