Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases (2404.02415v1)

Published 3 Apr 2024 in cs.CV

Abstract: Vision-language (VL) models, pretrained on colossal image-text datasets, have attained broad VL competence that is difficult to evaluate. A common belief is that a small number of VL skills underlie the variety of VL tests. In this paper, we perform a large-scale transfer learning experiment aimed at discovering latent VL skills from data. We reveal interesting characteristics that have important implications for test suite design. First, generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. Second, we demonstrate that factor analysis successfully identifies reasonable yet surprising VL skill factors, suggesting benchmarks could leverage similar analyses for task selection. Finally, we present a new dataset, OLIVE (https://github.com/jq-zh/olive-dataset), which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested. Our findings contribute to the design of balanced and broad-coverage vision-language evaluation methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6430–6439.
  2. The information complexity of learning tasks, their structure and their distance. Information and Inference: A Journal of the IMA, 10(1):51–72.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  4. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
  5. Touchstone: Evaluating vision-language models by language models. arXiv 2308.16890.
  6. David Barber. 2012. Bayesian Reasoning and Machine Learning. Cambridge University Press.
  7. Shai Ben-David and Reba Schuller. 2003. Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 567–580. Springer.
  8. Scale-localized abstract reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12557–12565.
  9. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv 2308.06595.
  10. Iglue: A benchmark for transfer learning across modalities, tasks, and languages. arXiv 2201.11732.
  11. VisualGPT: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18030–18040.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  13. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
  15. Nice perfume. how long did you marinate in it? multimodal sarcasm explanation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10563–10571.
  16. Kshitij Dwivedi and Gemma Roig. 2019. Representation similarity analysis for efficient task taxonomy & transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12387–12396.
  17. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369.
  18. Efficiently identifying task groupings for multi-task learning. In Advances in Neural Information Processing Systems, volume 34, pages 27503–27516.
  19. J.R. Firth. 1957. A synopsis of linguistic theory 1930-1955. Studies in Linguistic Analysis, page 1–32.
  20. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv 2306.13394.
  21. Richard L Gorsuch. 2014. Factor analysis: Classic edition. Routledge.
  22. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pages 6904–6913.
  23. Do androids laugh at electric sheep? humor “understanding” benchmarks from the new yorker caption contest. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 688–714, Toronto, Canada. Association for Computational Linguistics.
  24. John L Horn and John J McArdle. 2007. Understanding human intelligence since Spearman, pages 205–247.
  25. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  26. Drew A. Hudson and Christopher D. Manning. 2019. GQA: A new dataset for real-world visual reasoning and compositional question answering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6700–6709. Computer Vision Foundation / IEEE.
  27. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910.
  28. Henry F Kaiser. 1958. The varimax criterion for analytic rotation in factor analysis. Psychometrika, 23(3):187–200.
  29. OpenCQA: Open-ended question answering with charts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11817–11837, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  30. The hateful memes challenge: Detecting hate speech in multimodal memes. In NeurIPS.
  31. Abhishek Kumar and Hal Daume III. 2012. Learning task grouping and overlap in multi-task learning. arXiv preprint arXiv:1206.6417.
  32. Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc.
  33. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  34. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
  35. Lavis: A library for language-vision intelligence.
  36. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  37. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning.
  38. Reform-eval: Evaluating large vision language models via unified re-formulation of task-oriented benchmarks. arXiv 2310.02569.
  39. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  40. Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, volume 8693 of Lecture Notes in Computer Science, pages 740–755. Springer.
  41. Visual spatial reasoning. Transactions of the Association for Computational Linguistics.
  42. Improved baselines with visual instruction tuning.
  43. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  44. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  45. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS.
  46. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In NeurIPS Track on Datasets and Benchmarks.
  47. OK-VQA: a visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3195–3204.
  48. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland. Association for Computational Linguistics.
  49. Efficient estimation of word representations in vector space. arXiv 1301.3781.
  50. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE.
  51. OpenAI. 2023a. Chatgpt.
  52. OpenAI. 2023b. Gpt-4.
  53. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  54. John C Raven. 1938. Raven’s Progressive Matrices: Sets A, B, C, D, E. Australian Council for Educational Research.
  55. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 35:25278–25294.
  56. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV.
  57. Textcaps: a dataset for image captioningwith reading comprehension.
  58. Towards vqa models that can read. In CVPR, pages 8317–8326.
  59. Efficient and effective multi-task grouping via meta learning on task combinations. Advances in Neural Information Processing Systems, 35:37647–37659.
  60. C. Spearman. 1904. General intelligence, objectively determined and measured. American Journal of Psychology, 15(2):201–292.
  61. Which tasks should be learned together in multi-task learning? In International Conference on Machine Learning, pages 9120–9132. PMLR.
  62. GEM: A general evaluation benchmark for multimodal tasks. In Findings of the Association for Computational Linguistics, Online. Association for Computational Linguistics.
  63. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  64. Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. In Findings of the Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP).
  65. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  66. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  67. Connectivity patterns are task embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11993–12013.
  68. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
  69. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  70. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2.
  71. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv 2308.02490.
  72. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722.
  73. Vlue: A multi-task benchmark for evaluating vision-language models. In ICML.
  74. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Anthony Meng Huat Tiong (7 papers)
  2. Junqi Zhao (7 papers)
  3. Boyang Li (106 papers)
  4. Junnan Li (56 papers)
  5. Steven C. H. Hoi (94 papers)
  6. Caiming Xiong (337 papers)
Citations (5)