A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise (2312.12436v2)

Published 19 Dec 2023 in cs.CV, cs.AI, cs.CL, and cs.MM

Abstract: The surge of interest towards Multi-modal LLMs (MLLMs), e.g., GPT-4V(ision) from OpenAI, has marked a significant trend in both academia and industry. They endow LLMs with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks. Very recently, Google released Gemini, its newest and most capable MLLM built from the ground up for multi-modality. In light of the superior reasoning capabilities, can Gemini challenge GPT-4V's leading position in multi-modal learning? In this paper, we present a preliminary exploration of Gemini Pro's visual understanding proficiency, which comprehensively covers four domains: fundamental perception, advanced cognition, challenging vision tasks, and various expert capacities. We compare Gemini Pro with the state-of-the-art GPT-4V to evaluate its upper limits, along with the latest open-sourced MLLM, Sphinx, which reveals the gap between manual efforts and black-box systems. The qualitative samples indicate that, while GPT-4V and Gemini showcase different answering styles and preferences, they can exhibit comparable visual reasoning capabilities, and Sphinx still trails behind them concerning domain generalizability. Specifically, GPT-4V tends to elaborate detailed explanations and intermediate steps, and Gemini prefers to output a direct and concise answer. The quantitative evaluation on the popular MME benchmark also demonstrates the potential of Gemini to be a strong challenger to GPT-4V. Our early investigation of Gemini also observes some common issues of MLLMs, indicating that there still remains a considerable distance towards artificial general intelligence. Our project for tracking the progress of MLLM is released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

PDF Abstract

An Evaluation of Gemini Pro's Visual Expertise: Competition to GPT-4V

The paper "A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise" presents a thorough preliminary exploration of Gemini, a newly released multimodal LLM (MLLM) by Google. This paper investigates whether Gemini has the potential to surpass GPT-4V(ision) from OpenAI, renowned for its advanced multi-modal reasoning capabilities. The assessment spans four primary domains: fundamental perception, advanced cognition, challenging vision tasks, and expert capacities, employing both qualitative and quantitative methodologies.

Fundamental Perception

Fundamental perception is assessed across multiple sub-categories, including object-centric perception, scene-level perception, and knowledge-based perception. The paper reveals that both Gemini and GPT-4V excel at recognizing fundamental visual elements and responding to basic prompts. However, discrepancies arise in spatial relationship recognition and object counting tasks, where GPT-4V occasionally outperforms Gemini in detailing complex, cluttered environments. Nevertheless, both models face challenges with more intricate visual tasks, such as accurately identifying left and right or counting objects in complex scenes.

Advanced Cognition

Advanced cognition tests emphasize understanding and reasoning over more complex visual stimuli. Tasks include text-rich visual reasoning, problem-solving in science disciplines like mathematics and physics, abstract visual reasoning, and emotion understanding. The authors note that while both Gemini and GPT-4V demonstrate significant capabilities, GPT-4V tends to generate more detailed intermediate reasoning steps. Conversely, Gemini often provides concise answers, which can be advantageous for quick information retrieval. Both models, however, struggle with optical illusions, advanced abstract reasoning tasks like Raven’s Matrices, and mathematical calculations requiring precise OCR interpretation.

Challenging Vision Tasks

In this domain, tasks such as object detection, phrase localization, and video action recognition are explored. Neither Gemini nor GPT-4V achieves the precision of dedicated vision models in object detection, and both struggle with accurately producing bounding boxes for objects based on referring expressions. Notably, GPT-4V's strategy of leveraging external tools for vision tasks underscores its limitations in direct visual reasoning. Both models show competency in video action recognition, though GPT-4V consistently provides richer narrative details.

Expert Capacity

Explorations in expert capacities cover diverse areas, including autonomous driving, defect detection, medical diagnosis, economic analysis, surveillance, remote sensing, and robot motion planning. Results indicate that Gemini and GPT-4V can adequately interpret basic visual information and make appropriate decisions. However, challenges persist in fine-grained tasks such as recognizing low-resolution traffic signs, detecting subtle defects, accurately diagnosing complex medical images, and providing precise economic forecasts. GPT-4V occasionally refrains from making predictions on the basis of privacy or safety concerns.

Quantitative Assessment on MME Benchmark

The MME benchmark provides a quantitative comparison, with metrics covering perception and cognition tasks. In overall performance, Gemini achieves a slightly higher combined score than GPT-4V, with scores of 1933.4 and 1926.6, respectively. Gemini's balanced performance across tasks contrasts with GPT-4V's exceptional cognition reasoning. Interestingly, Sphinx shows competitive performance in perception tasks but falls short in cognition scenarios, highlighting the limitations of open-source models compared to proprietary systems.

Conclusion and Future Directions

The paper concludes that while Gemini presents a robust challenge to GPT-4V, both models exhibit respective strengths and weaknesses. Gemini excels in concise, rapid responses, making it suitable for applications requiring quick decision-making. GPT-4V’s detailed intermediate steps enhance transparency and explainability, vital for complex problem-solving. The paper identifies common issues, such as spatial information sensitivity, OCR accuracy, logical consistency, and robustness to prompt variations.

Future work should focus on improving fine-grained visual representations, aligning multi-modal integration more effectively to mitigate hallucinations, and enhancing the reasoning capabilities of LLMs. While significant strides have been made towards artificial general intelligence, the journey remains intensive, necessitating continuous advancements in MLLMs.

In summary, this paper provides an insightful comparative analysis of Gemini and GPT-4V, contributing valuable data to the domain of MLLMs and setting the stage for subsequent developments in visual understanding competencies.