PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns (2403.13315v3)
Abstract: Large multimodal models extend the impressive capabilities of LLMs by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of 2000 puzzle instances based on abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments on state-of-the-art large multimodal models, we find that they are not able to generalize well to simple abstract patterns. Notably, GPT-4V achieves a score of 46.4% on single-concept puzzles, which shows that state-of-the-art models struggle on our dataset. To diagnose the reasoning challenges in large multimodal models, we progressively guide the models with our ground truth reasoning explanations for visual perception, inductive reasoning, and deductive reasoning. Our systematic analysis finds that the main bottlenecks of GPT-4V are weaker visual perception and inductive reasoning abilities. Through this work, we hope to shed light on the limitations of large multimodal models and how they can better emulate human cognitive processes in the future. Our data and code are available at https://puzzlevqa.github.io
- VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV).
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv, abs/2303.12712.
- Susan Carey. 2000. The origin of concepts. Journal of Cognition and Development, 1:37 – 41.
- Raymond Bernard Cattell. 1963. Theory of fluid and crystallized intelligence: A critical experiment. Journal of Educational Psychology, 54:1–22.
- Charles Cole. 1996. Fluid concepts and creative analogies: Computer models of the fundamental mechanisms of thought. Journal of the Association for Information Science and Technology, 47:403–404.
- Is gpt-3 a good data annotator? In Annual Meeting of the Association for Computational Linguistics.
- Google Gemini Team. 2023. Gemini: A family of highly capable multimodal models.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR.
- Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
- Human few-shot learning of compositional instructions. In Annual Meeting of the Cognitive Science Society.
- Improved baselines with visual instruction tuning.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR).
- The conceptarc benchmark: Evaluating understanding and generalization in the arc domain. ArXiv, abs/2305.07141.
- OpenAI. 2023. Gpt-4v(ision) system card.
- Jean Piaget. 1976. Piaget’s theory.
- Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. In The Twelfth International Conference on Learning Representations.
- A vision check-up for language models. ArXiv, abs/2401.01862.
- Cognitive architectures for language agents. ArXiv, abs/2309.02427.
- Joshua B. Tenenbaum. 2018. Building machines that learn and think like people. In Adaptive Agents and Multi-Agent Systems.
- Eyes wide shut? exploring the visual shortcomings of multimodal llms. ArXiv, abs/2401.06209.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision). ArXiv, abs/2309.17421.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502.
- Raven: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Yew Ken Chia (24 papers)
- Vernon Toh Yan Han (4 papers)
- Deepanway Ghosal (33 papers)
- Lidong Bing (144 papers)
- Soujanya Poria (138 papers)