An Evaluation of Gemini Pro's Visual Expertise: Competition to GPT-4V
The paper "A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise" presents a thorough preliminary exploration of Gemini, a newly released multimodal LLM (MLLM) by Google. This paper investigates whether Gemini has the potential to surpass GPT-4V(ision) from OpenAI, renowned for its advanced multi-modal reasoning capabilities. The assessment spans four primary domains: fundamental perception, advanced cognition, challenging vision tasks, and expert capacities, employing both qualitative and quantitative methodologies.
Fundamental Perception
Fundamental perception is assessed across multiple sub-categories, including object-centric perception, scene-level perception, and knowledge-based perception. The paper reveals that both Gemini and GPT-4V excel at recognizing fundamental visual elements and responding to basic prompts. However, discrepancies arise in spatial relationship recognition and object counting tasks, where GPT-4V occasionally outperforms Gemini in detailing complex, cluttered environments. Nevertheless, both models face challenges with more intricate visual tasks, such as accurately identifying left and right or counting objects in complex scenes.
Advanced Cognition
Advanced cognition tests emphasize understanding and reasoning over more complex visual stimuli. Tasks include text-rich visual reasoning, problem-solving in science disciplines like mathematics and physics, abstract visual reasoning, and emotion understanding. The authors note that while both Gemini and GPT-4V demonstrate significant capabilities, GPT-4V tends to generate more detailed intermediate reasoning steps. Conversely, Gemini often provides concise answers, which can be advantageous for quick information retrieval. Both models, however, struggle with optical illusions, advanced abstract reasoning tasks like Raven’s Matrices, and mathematical calculations requiring precise OCR interpretation.
Challenging Vision Tasks
In this domain, tasks such as object detection, phrase localization, and video action recognition are explored. Neither Gemini nor GPT-4V achieves the precision of dedicated vision models in object detection, and both struggle with accurately producing bounding boxes for objects based on referring expressions. Notably, GPT-4V's strategy of leveraging external tools for vision tasks underscores its limitations in direct visual reasoning. Both models show competency in video action recognition, though GPT-4V consistently provides richer narrative details.
Expert Capacity
Explorations in expert capacities cover diverse areas, including autonomous driving, defect detection, medical diagnosis, economic analysis, surveillance, remote sensing, and robot motion planning. Results indicate that Gemini and GPT-4V can adequately interpret basic visual information and make appropriate decisions. However, challenges persist in fine-grained tasks such as recognizing low-resolution traffic signs, detecting subtle defects, accurately diagnosing complex medical images, and providing precise economic forecasts. GPT-4V occasionally refrains from making predictions on the basis of privacy or safety concerns.
Quantitative Assessment on MME Benchmark
The MME benchmark provides a quantitative comparison, with metrics covering perception and cognition tasks. In overall performance, Gemini achieves a slightly higher combined score than GPT-4V, with scores of 1933.4 and 1926.6, respectively. Gemini's balanced performance across tasks contrasts with GPT-4V's exceptional cognition reasoning. Interestingly, Sphinx shows competitive performance in perception tasks but falls short in cognition scenarios, highlighting the limitations of open-source models compared to proprietary systems.
Conclusion and Future Directions
The paper concludes that while Gemini presents a robust challenge to GPT-4V, both models exhibit respective strengths and weaknesses. Gemini excels in concise, rapid responses, making it suitable for applications requiring quick decision-making. GPT-4V’s detailed intermediate steps enhance transparency and explainability, vital for complex problem-solving. The paper identifies common issues, such as spatial information sensitivity, OCR accuracy, logical consistency, and robustness to prompt variations.
Future work should focus on improving fine-grained visual representations, aligning multi-modal integration more effectively to mitigate hallucinations, and enhancing the reasoning capabilities of LLMs. While significant strides have been made towards artificial general intelligence, the journey remains intensive, necessitating continuous advancements in MLLMs.
In summary, this paper provides an insightful comparative analysis of Gemini and GPT-4V, contributing valuable data to the domain of MLLMs and setting the stage for subsequent developments in visual understanding competencies.