Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs (2401.06209v2)

Published 11 Jan 2024 in cs.CV
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Abstract: Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of LLMs. However, the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences. With these pairs, we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems, including GPT-4V, struggle with straightforward questions across nine basic visual patterns, often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-LLMs and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues, we propose a Mixture of Features (MoF) approach, demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together, our research suggests visual representation learning remains an open challenge, and accurate visual grounding is crucial for future successful multimodal systems.

Introduction to Multimodal LLMs

The domain of artificial intelligence has seen a noteworthy leap forward with the integration of visual data into LLMs, leading to the creation of Multimodal LLMs (MLLMs). These advanced systems have demonstrated their efficacy in understanding images, answering visual questions, and following instructions that involve visual content. Among the latest innovative models is GPT-4V(ision) which has established a new benchmark in the field.

Exploring Visual Limitations

Despite these strides, a critical examination suggests that MLLMs are still hindered by essential visual understanding issues. This paper explores the possibility that these challenges may originate from the visual representations learned during the training process, utilizing a methodology based on Contrastive Language-Image Pre-Training (CLIP) vision encoders. The paper purposefully identifies pairs of visually distinct images that the CLIP model erroneously perceives as similar. Termed as "CLIP-blind pairs", they serve as a benchmark called the Multimodal Visual Patterns (MMVP) to evaluate the visual aptitude of MLLMs. When examined against this benchmark, it was found that the models, including GPT-4V, often struggled with basic visual queries, suggesting that visual representation learning models might not yet be sufficiently developed for robust multimodal understanding.

Visual Patterns Challenging MLLMs

Further investigation identified nine specific visual patterns that persisted in tripping up CLIP-based MLLMs, such as orientation, counting, and viewpoint. Observing that not all limitations were resolved by scaling up models, the research brought to light a significant correlation: difficulties faced by the CLIP model in recognizing certain visual patterns were often reflected in the overall performance of the MLLMs. This correlation hints that constraints inherent to the design of the vision components in these models might be limiting their capabilities.

Improving Visual Grounding in MLLMs

The paper didn’t stop at merely identifying the deficiencies; it also proposed a path towards enhancement. A novel approach named Mixture-of-Features (MoF) indicates a way to improve the visual grounding of MLLMs. By incorporating features from vision-only self-supervised learning models into the CLIP features, it is plausible to strengthen the visual representation without negatively impacting the models' language instruction-following prowess.

Conclusion and Future Directions

This paper shines a light on a pertinent issue in today's advanced MLLMs: their visual understanding is not as profound as might be assumed. Even models celebrated for their complexity and range, such as GPT-4V, are prone to making elementary mistakes when decoding visual information. This research suggests that an effective way forward may include improved or alternative visual representation methods, in conjunction with more diversified evaluation benchmarks. By addressing these visual shortcomings and laying the groundwork for future innovations, the advancement of MLLMs can continue in a direction that is truly representative of both the literal and figurative 'big picture.'

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. ShareGPT, 2023.
  2. Flamingo: a visual language model for few-shot learning. In NeruIPS, 2022.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Self-supervised learning from images with a joint-embedding predictive architecture. In CVPR, 2023.
  5. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. 2022.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In ICML, 2021.
  10. Data filtering networks. arXiv preprint arXiv:2309.17425, 2023.
  11. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  12. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In NAACL, 2019.
  13. Google. Bard, 2023a.
  14. Google. Gemini, 2023b.
  15. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  16. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, 2020.
  17. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  18. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  19. Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality. In NeurIPS, 2023.
  20. Prompt-based methods may underestimate large language models’ linguistic generalizations. In EMNLP, 2023.
  21. GQA: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  22. Shap-E: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  23. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  24. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  25. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  26. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
  27. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
  28. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for GPT-4V (ision), LLaVA-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023a.
  29. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023b.
  30. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023c.
  31. Visual instruction tuning. 2023d.
  32. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023e.
  33. Decoupled weight decay regularization. In ICLR, 2017.
  34. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  35. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  36. On measuring social biases in sentence encoders. In NAACL, 2019.
  37. Microsoft. newbing, 2023.
  38. OCR-VQA: Visual question answering by reading text in images. In ICDAR, 2019.
  39. Slip: Self-supervision meets language-image pre-training. In ECCV, 2022.
  40. OpenAI. GPT-4V(ision) System Card, 2023a.
  41. OpenAI. Gpt-4 technical report, 2023b.
  42. DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  43. Learning transferable visual models from natural language supervision. In ICML, 2021.
  44. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  45. Imagenet-21k pretraining for the masses. In NeurIPS, 2021.
  46. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  47. Imagenet large scale visual recognition challenge. IJCV, 2015.
  48. LAION-5B: An open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  49. A-OKVQA: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
  50. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  51. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020.
  52. Towards VQA models that can read. In CVPR, 2019.
  53. The effectiveness of MAE pre-pretraining for billion-scale pretraining. In ICCV, 2023.
  54. EVA-CLIP: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  55. Mitigating gender bias in natural language processing: Literature review. In ACL, 2019.
  56. Winoground: Probing vision and language models for visio-linguistic compositionality. In CVPR, 2022.
  57. Mass-producing failures of multimodal systems with language models. In NeurIPS, 2023.
  58. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  59. LLaMA 2: Open foundation and fine-tuned chat models. 2023b.
  60. Image captioners are scalable vision learners too. NeurIPS, 2023.
  61. Convnet vs transformer, supervised vs clip: Beyond imagenet accuracy, 2024.
  62. Demystifying CLIP data. arXiv preprint arXiv:2309.16671, 2023.
  63. The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421, 2023.
  64. MM-Vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  65. When and why vision-language models behave like bags-of-words, and what to do about it? In ICLR, 2022.
  66. Sigmoid loss for language image pre-training. In ICCV, 2023a.
  67. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023b.
  68. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  69. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023.
  70. iBOT: Image BERT pre-training with online tokenizer. In ICLR, 2021.
  71. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shengbang Tong (25 papers)
  2. Zhuang Liu (63 papers)
  3. Yuexiang Zhai (18 papers)
  4. Yi Ma (188 papers)
  5. Yann LeCun (173 papers)
  6. Saining Xie (60 papers)
Citations (179)
Youtube Logo Streamline Icon: https://streamlinehq.com