What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases (2404.02415v1)

Published 3 Apr 2024 in cs.CV

Abstract: Vision-language (VL) models, pretrained on colossal image-text datasets, have attained broad VL competence that is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications for test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting benchmarks could leverage similar analyses for task selection. Finally, we present a new dataset, OLIVE (https://github.com/jq-zh/olive-dataset), which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (74)

Authors (6)

Anthony Meng Huat Tiong (7 papers)
Junqi Zhao (7 papers)
Boyang Li (106 papers)
Junnan Li (56 papers)
Steven C. H. Hoi (94 papers)
Caiming Xiong (337 papers)

Citations (5)

View on Semantic Scholar

Tweets

https://twitter.com/AlbertBoyangLi/status/1783722077637448007

https://twitter.com/AlbertBoyangLi/status/1775908007802818992

https://twitter.com/vishaal_urao/status/1793707620219646403

https://twitter.com/CSVisionPapers/status/1775907556642230740

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases (2404.02415v1)

Related Papers

Tweets