How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks (2507.01955v1)

Published 2 Jul 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.

Summary

The paper introduces a prompt chaining framework that decomposes vision tasks into text-promptable sub-tasks to harness multimodal model strengths.
The paper finds that GPT-4o excels in semantic classification but underperforms in object detection and dense geometric predictions.
The paper reveals that while multimodal foundation models serve as capable generalists, they lag behind specialist vision models in precision metrics.

Evaluation of GPT-4o and Multimodal Foundation Models on Standard Computer Vision Tasks

This paper presents a comprehensive empirical evaluation of leading multimodal foundation models (MFMs), including GPT-4o, Gemini 1.5 Pro and 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, and Llama 3.2, on a suite of classical computer vision tasks: image classification, object detection, semantic segmentation, grouping, depth estimation, and surface normal prediction. The evaluation is conducted on established datasets such as ImageNet, COCO, and Hypersim, with a focus on quantifying the visual understanding capabilities of MFMs relative to state-of-the-art vision specialist models.

Methodology: Prompt Chaining for Vision Tasks

A central challenge addressed in this work is the incompatibility between the text-based output interfaces of most MFMs and the structured outputs required by standard vision tasks (e.g., segmentation masks, bounding boxes, dense depth maps). The authors introduce a prompt chaining framework that decomposes each vision task into a sequence of text-promptable sub-tasks, leveraging the models' strengths in image classification and visual reasoning. For example:

Object Detection: The image is recursively partitioned into grids, and the model is prompted to identify the presence of target objects in each cell, progressively narrowing down the location.
Semantic Segmentation: Images are over-segmented into superpixels, and the model is prompted to classify each superpixel, with multi-scale context provided to improve accuracy.
Depth and Surface Normals: Pairwise comparisons between superpixels are used to infer relative depth or normal orientation, with global rankings obtained via optimization.

This approach enables a standardized, API-compatible benchmarking protocol for MFMs, facilitating direct comparison with vision specialists under controlled algorithmic constraints.

Experimental Results

Image Classification

MFMs, particularly GPT-4o and Gemini 2.0 Flash, achieve strong results across ImageNet and its variants, demonstrating robustness to distribution shifts and corruptions. However, all MFMs fall short of the top-performing vision specialists (e.g., Model Soups ViT-G, OpenCLIP H), with a gap of approximately 7–14 percentage points in top-1 accuracy on ImageNet.

Object Detection

All MFMs underperform relative to specialist models such as Co-DETR and DETR. GPT-4o achieves the highest MFM performance (AP50: 60.62), but this remains significantly below the specialist baseline (Co-DETR AP50: 91.30). The prompt chaining approach is shown to be critical; direct regression of bounding box coordinates by MFMs yields substantially worse results.

Semantic Segmentation and Grouping

MFMs exhibit nontrivial segmentation capabilities, with GPT-4o again leading among MFMs (mIoU: 44.89), but a substantial gap persists compared to OneFormer (mIoU: 65.52). For grouping, GPT-4o and Gemini 2.0 Flash perform best among MFMs, but all models lag behind the Segment Anything Model (SAM).

Depth and Surface Normal Prediction

Performance on geometric tasks is notably weaker. All MFMs are outperformed by Omnidata and 4M-21, with the gap more pronounced than in semantic tasks. Reasoning-focused models (e.g., o4-mini, o1, o3) show relative improvements in geometric reasoning, particularly in surface normal prediction, where they correct common failure modes observed in GPT-4o.

Prompt Sensitivity and Cost

Prompt chaining significantly improves performance over direct prompting, especially for structured tasks. Better-performing models (e.g., GPT-4o) exhibit lower sensitivity to prompt variations. The evaluation framework incurs substantial computational and monetary cost due to the large number of API calls required for dense predictions.

Image Generation Capabilities

Preliminary analysis of GPT-4o's native image generation reveals that outputs tend to be semantic recreations rather than precise per-pixel edits, leading to hallucinations and spatial misalignments. This limits the direct applicability of current image generation features for dense vision tasks.

Key Findings and Claims

MFMs are not competitive with state-of-the-art vision specialists on any standard vision task.
MFMs are strong generalists: Despite being trained primarily on image-text data, they achieve respectable performance across a wide range of tasks.
Semantic understanding exceeds geometric reasoning: MFMs perform better on tasks requiring semantic classification than on those requiring geometric or 3D understanding.
Prompt chaining is essential: Decomposing tasks into sub-tasks aligned with the models' strengths is necessary for extracting maximal performance.
Reasoning models show promise for geometric tasks: Recent models with explicit reasoning capabilities (e.g., o1, o3, o4-mini) demonstrate improved performance on depth and surface normal estimation.
Image generation outputs are not yet suitable for dense vision evaluation: Current models hallucinate and misalign outputs, indicating a need for further research.

Implications and Future Directions

Practical Implications

Benchmarking: The prompt chaining framework provides a standardized method for evaluating MFMs on vision tasks, enabling fair comparison with vision specialists and across MFMs.
Deployment: While MFMs can serve as generalist vision systems, their performance is insufficient for applications requiring high-precision geometric understanding or dense predictions.
Prompt Engineering: Careful design of prompt chains is critical for leveraging MFM capabilities in structured vision tasks; direct prompting is inadequate.

Theoretical Implications

Training Regimes: The observed gap between MFMs and vision specialists suggests that current multimodal pretraining (dominated by image-text pairs) is insufficient for learning fine-grained geometric representations.
Model Architecture: The success of reasoning-augmented MFMs on geometric tasks points to the value of explicit reasoning modules or training objectives targeting 3D understanding.

Future Developments

Closing the Gap: Future MFMs may incorporate direct supervision on dense vision tasks, or integrate architectural inductive biases from vision specialists, to bridge the performance gap.
Efficient Evaluation: Reducing the computational and monetary cost of benchmarking dense vision tasks with MFMs is an open challenge, potentially addressable via more efficient prompt chaining or model-side adaptations.
Robust Image Generation: Improving the fidelity and spatial alignment of image generation outputs is necessary for MFMs to be evaluated (and deployed) on dense prediction tasks without prompt chaining.

Conclusion

This work establishes a rigorous, extensible benchmark for quantifying the visual understanding of MFMs on standard computer vision tasks. The results highlight both the versatility and current limitations of MFMs, particularly in geometric reasoning and dense prediction. The prompt chaining methodology and open-source evaluation suite provide a foundation for tracking progress as future MFMs evolve toward more comprehensive visual intelligence.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1942231810060001770

https://twitter.com/HuggingPapers/status/1942194258905317381

https://twitter.com/morris_phd/status/1940771801384333413

https://twitter.com/javaeeeee1/status/1942180515269771320

https://twitter.com/_skillsharer_/status/1940742461329928395

https://twitter.com/ResearchBitesAI/status/1942237286705545373