Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models (2407.19474v2)

Published 28 Jul 2024 in cs.CV and cs.CL

Abstract: Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and LLMs on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, ground-truth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with Gemini-Pro-1.5 leading with 40% accuracy. Our benchmark comes with automatic evaluation tasks to make assessment scalable. These findings underscore the potential of Visual Riddles as a valuable resource for enhancing vision and LLMs' capabilities in interpreting complex visual scenarios.

PDF HTML Abstract

Visual Riddles: A Commonsense and World Knowledge Challenge for Large Vision and LLMs

The paper "Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and LLMs" introduces a comprehensive benchmark designed to evaluate the interpretative and reasoning capabilities of vision and LLMs (VLMs). The core of this benchmark, referred to as Visual Riddles, comprises 400 visual riddles that combine synthetic images with intricate questions requiring substantial commonsense and world knowledge.

Core Contributions and Methodology

The Visual Riddles benchmark represents a departure from existing benchmarks in the field by generating unique images using diverse text-to-image models rather than relying on pre-existing datasets. This design choice allows for greater creativity and a broader spectrum of everyday scenarios, subsequently posing a more significant challenge to VLMs.

Each visual riddle in the benchmark includes:

An Image: Synthetic and generated specifically for the task, designed to include subtle clues that are essential for solving the riddle.
A Question: Aimed at testing the model's ability to fuse visual context with commonsense reasoning.
A Ground-Truth Answer: The correct answer for benchmarking purposes.
A Textual Hint: Provided to guide models towards crucial visual elements.
Attributions: Source links to factual knowledge supporting the answer, enhancing the difficulty and depth of the benchmark.

The paper details the creation process involving a group of designers who hand-curated the riddles to ensure a diversity of scenarios. Images were peer-reviewed to maintain quality and challenge levels, with each incorporating commonsense or world knowledge cues.

Performance Evaluation and Key Findings

To evaluate the benchmark, the authors conducted extensive experiments using state-of-the-art vision-LLMs, including LLaVA, InstructBLIP, GPT4, and Gemini-Pro variants. Human performance was also benchmarked through a rigorous annotation process using crowdsourcing platforms like Amazon Mechanical Turk.

Experimental Framework

Open-Ended VQA Task: Models were presented with images and asked open-ended questions. Human annotators evaluated the correctness of models' answers.
Multiple-Choice VQA Task: Model performance was assessed by selecting the correct answer from a list of options, with automatic scoring based on accuracy.
Automatic Evaluation Tasks: Models' abilities to evaluate open-ended responses were tested in both reference-free and reference-based scenarios, identifying the best performers for auto-rating purposes.

Numerical Performance

Humans consistently outperformed models, achieving an accuracy of 82% in open-ended VQA tasks compared to the best model, Gemini-Pro-1.5, which achieved 40%. Multiple-choice tasks yielded marginally better model performance but still highlighted a considerable gap, with GPT-4 achieving the highest accuracy at 45%.

The paper also notes significant improvements when textual hints and attributions were provided, leading to smarter disambiguation in multiple-choice tasks. However, models often defaulted to “cannot determine” when faced with ambiguity.

Theoretical and Practical Implications

The substantial performance gap between humans and models underscores continued deficiencies in VLMs' ability to integrate commonsense reasoning and world knowledge with visual interpretation effectively. This work emphasizes the need for further development in multimodal AI models to handle complex scenarios that resemble real-world situations.

The integration of intricate textual hints and factual attributions in the benchmark introduces a higher cognitive burden, aligning the challenge more closely with human-like reasoning. This reinforces the importance of context-aware and knowledge-integrated AI systems, fostering advancements beyond mere visual recognition and towards sophisticated understanding and inference abilities.

Prospects and Future Developments

The Visual Riddles benchmark sets a new standard for evaluating VLMs, pushing for advancements that bridge the gap between human cognition and AI capabilities. Future research directions might include:

Higher Dimensionality of Scenarios: Expanding the variety and complexity of scenarios tested within the benchmark.
Model Interpretability: Enhancing models' abilities to explain their reasoning processes and decisions.
Integration with Additional Modalities: Incorporating auditory or kinesthetic cues to craft more holistic multimodal benchmarks.

In conclusion, Visual Riddles presents a rigorous testbed for future vision and LLMs, driving innovation in the integration of visual perception with extensive commonsense and world knowledge, thereby advancing the field towards more human-like AI reasoning capabilities.

For further details and access to the dataset, the authors have made the Visual Riddles dataset, code, and leaderboard publicly available at https://visual-riddles.github.io/.