An Evaluation of GPT-4V and Gemini in Online VQA (2312.10637v2)

Published 17 Dec 2023 in cs.CV and cs.AI

Abstract: While there is much excitement about the potential of large multimodal models (LMM), a comprehensive evaluation is critical to establish their true capabilities and limitations. In support of this aim, we evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset sourced from an authentic online question answering community. We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions, such as image type and the required image processing capabilities. Our zero-shot performance analysis highlights the types of questions that are most challenging for both models, including questions related to "puzzling" topic, with "Identification" user intention, with "Sheet Music" image type, or labeled as "hard" by GPT-4.

PDF HTML Abstract

Introduction

In the domain of AI, Large Multimodal Models (LMMs) play a pivotal role in understanding and processing complex combinations of text and visual content. This evaluation concentrates on the performance of two state-of-the-art LMMs, GPT-4V and Gemini, employing an online Visual Question Answering (VQA) approach derived from real-world user interactions on the Stack Exchange platform. The paper taps into these interactions to test the general capabilities of the models across various criteria, including the alignment of these interactions with actual user needs.

Dataset Overview

Researchers used the VQAonline dataset, which comprises natural language questions, contexts, images, and verified answers obtained from the Stack Exchange platform, to create a robust environment for evaluating the models. The filtering process guarantees a focus on questions with higher-quality content, and a subset was chosen to manage the expensive and rate-limited API calls to the evaluated LMMs. The final subset used for evaluation consists of 1,903 samples, providing depth and variety in content.

Evaluation Methodology

The methodology is designed to appraise the zero-shot performance of GPT-4V and Gemini without the use of advanced prompt-engineering techniques. Answers generated by the models are checked for correctness against ground-truth answers using GPT-4, and this rating is cross-validated through its alignment with human expert judgments. Metadata generation is another aspect of evaluation, adding layers such as user intention and image processing capabilities to the analysis.

Evaluation Results

The assessment revealed the following insights:

GPT-4V and Gemini displayed commendable proficiency in domains like language learning and economics, while struggling with topics such as puzzling and LEGO-related queries.
GPT-4V surpassed Gemini in understanding social sciences and natural science topics and demonstrated strengths in non-visual questions, indicating no need for image processing.
Questions tagged under user intention as 'Identification' were difficult for both models, highlighting an area for potential improvement.
When it came to image processing capabilities, GPT-4V faced the most challenges with feature extraction tasks, whereas Gemini showed a strong command in scene understanding.
Both models performed least efficiently on questions requiring expert knowledge, possibly indicating the need for more specialized training in domain-specific areas.
The most challenging image types for both models included "3D Renderings" and "Sheet Music," suggesting a gap in interpreting these image variants.
As the difficulty level escalated, the performance of both models declined.

Conclusion

This investigation provides a comprehensive analysis of the capabilities and limitations of GPT-4V and Gemini on an authentic, user-generated VQA dataset. The findings offer significant insights into areas where these LMMs excel and where they could be improved upon. While the analysis confirms the adeptness of the models in handling a range of topics and questions, it also indicates the need for enhancement in processing complex images and expert knowledge areas. Future work is aimed at refining the analysis methods and incorporating additional datasets to deepen the understanding of LMM performances.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Mengchen Liu (48 papers)
Chongyan Chen (12 papers)
Danna Gurari (32 papers)

Citations (7)

View on Semantic Scholar