Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks (2507.01955v1)

Published 2 Jul 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Multimodal foundation models, such as GPT-4o, have recently made remarkable progress, but it is not clear where exactly these models stand in terms of understanding vision. In this paper, we benchmark the performance of popular multimodal foundation models (GPT-4o, o4-mini, Gemini 1.5 Pro and Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, Llama 3.2) on standard computer vision tasks (semantic segmentation, object detection, image classification, depth and surface normal prediction) using established datasets (e.g., COCO, ImageNet and its variants, etc). The main challenges to performing this are: 1) most models are trained to output text and cannot natively express versatile domains, such as segments or 3D geometry, and 2) many leading models are proprietary and accessible only at an API level, i.e., there is no weight access to adapt them. We address these challenges by translating standard vision tasks into equivalent text-promptable and API-compatible tasks via prompt chaining to create a standardized benchmarking framework. We observe that 1) the models are not close to the state-of-the-art specialist models at any task. However, 2) they are respectable generalists; this is remarkable as they are presumably trained on primarily image-text-based tasks. 3) They perform semantic tasks notably better than geometric ones. 4) While the prompt-chaining techniques affect performance, better models exhibit less sensitivity to prompt variations. 5) GPT-4o performs the best among non-reasoning models, securing the top position in 4 out of 6 tasks, 6) reasoning models, e.g. o3, show improvements in geometric tasks, and 7) a preliminary analysis of models with native image generation, like the latest GPT-4o, shows they exhibit quirks like hallucinations and spatial misalignments.

Summary

  • The paper introduces a prompt chaining framework that decomposes dense vision tasks into manageable, text-based sub-tasks.
  • The study finds that GPT-4o excels in semantic tasks like classification (77.2% on ImageNet) but lags in geometric tasks compared to specialist models.
  • The results indicate that reasoning augmentation and refined prompt engineering can partially bridge the gap in complex, structured vision challenges.

Evaluation of GPT-4o and Multimodal Foundation Models on Standard Computer Vision Tasks

This paper presents a comprehensive empirical evaluation of leading multimodal foundation models (MFMs), including GPT-4o, Gemini 1.5 Pro and 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, and Llama 3.2, on a suite of classical computer vision tasks: image classification, object detection, semantic segmentation, grouping, depth estimation, and surface normal prediction. The paper addresses the challenge of benchmarking MFMs—whose primary output modality is text—on tasks that traditionally require dense, structured outputs, such as pixel-wise segmentation or 3D geometry prediction.

Methodology

The authors introduce a prompt chaining framework that decomposes each vision task into a sequence of sub-tasks, each solvable via text-based prompts. For example, object detection is reformulated as a recursive grid search, where the model is queried about the presence of objects in image regions, progressively narrowing down bounding boxes. Semantic segmentation is approached by clustering images into superpixels and classifying each group, leveraging the models' relative strength in image classification. Depth and surface normal estimation are cast as pairwise region ranking problems, with global rankings inferred via optimization.

This approach enables the evaluation of MFMs through their public APIs, without requiring access to model weights or internal representations. The framework is not proposed as a practical solution for deploying MFMs in production vision pipelines, but rather as a standardized benchmarking tool to assess and compare their visual understanding capabilities.

Experimental Results

1. Performance Relative to Vision Specialists

Across all tasks, MFMs—despite their generalist training—consistently underperform compared to state-of-the-art vision specialist models. For instance, in image classification on ImageNet, GPT-4o achieves 77.2% top-1 accuracy, trailing behind specialist models like Model Soups ViT-G (90.94%). In object detection (COCO), GPT-4o attains an AP of 31.87, compared to 80.23 for Co-DETR. For semantic segmentation (COCO), GPT-4o reaches 44.89 mIoU, while OneFormer achieves 65.52.

2. Semantic vs. Geometric Tasks

A key finding is that MFMs perform notably better on semantic tasks (classification, segmentation, grouping) than on geometric tasks (depth, surface normals). For depth estimation, the best MFMs achieve a Spearman correlation of ~0.54, compared to 0.95 for Omnidata. For surface normals, most MFMs fail to achieve positive correlation along the horizontal axis, indicating a lack of robust 3D understanding.

3. Model Ranking and Prompt Sensitivity

GPT-4o is the strongest non-reasoning MFM, ranking first in 4 out of 6 tasks. Gemini 2.0 Flash and 1.5 Pro follow, with Claude 3.5 Sonnet, Qwen2-VL, and Llama 3.2 trailing. The paper also evaluates reasoning-augmented models (o1, o3, o4-mini), which show improved performance on geometric tasks, suggesting that explicit reasoning capabilities can partially compensate for the lack of direct geometric supervision.

Prompt chaining significantly outperforms direct prompting for structured tasks, and better-performing models exhibit less sensitivity to prompt variations. However, all models remain susceptible to prompt design, especially for tasks requiring fine-grained localization or geometric reasoning.

4. Image Generation Capabilities

Preliminary analysis of GPT-4o's native image generation reveals that, when tasked with producing dense outputs (e.g., segmentation masks), the model tends to generate semantic recreations rather than precise, spatially aligned edits. This leads to hallucinations and misalignments, limiting the utility of current image generation features for dense prediction tasks.

5. Generalization and Data Contamination

To address concerns about data contamination, the authors evaluate MFMs on "in-the-wild" images released after the models' training cutoffs. The models demonstrate reasonable generalization, but the performance gap with vision specialists persists.

6. Computational and Cost Considerations

The prompt chaining framework incurs significant computational and monetary costs due to the large number of API calls required for dense tasks. While this is acceptable for benchmarking, it is prohibitive for practical deployment.

Implications and Future Directions

Practical Implications

  • MFMs as Generalists: While MFMs like GPT-4o are not competitive with vision specialists on dense prediction tasks, their respectable performance as generalists—especially on semantic tasks—suggests utility in scenarios where task diversity and flexibility are prioritized over peak accuracy.
  • Prompt Engineering: The strong dependence on prompt design highlights the need for systematic prompt optimization and possibly automated prompt search for robust benchmarking and deployment.
  • API-Only Evaluation: The prompt chaining methodology provides a viable path for benchmarking closed-source MFMs, enabling fair comparison with open-weight and specialist models.

Theoretical Implications

  • Semantic Bias: The superior performance on semantic tasks indicates that current MFM training regimes, dominated by image-text data, bias models toward semantic understanding at the expense of geometric reasoning.
  • Reasoning Augmentation: The improvements observed in reasoning-augmented models (o1, o3, o4-mini) on geometric tasks suggest that explicit reasoning modules or training objectives may help bridge the gap in 3D understanding.

Limitations and Open Questions

  • Efficiency: The high inference cost of prompt chaining precludes its use in real-time or large-scale applications.
  • Upper Bounds: The framework's reliance on superpixels and region-based queries introduces algorithmic constraints, but control baselines (oracle + chain) demonstrate that finer granularity can improve results, indicating that current MFM performance is not bottlenecked by the evaluation method.
  • Image Generation: Current image generation capabilities in MFMs are insufficient for dense prediction tasks due to spatial misalignment and hallucinations. Further research is needed to align generative outputs with structured vision requirements.

Future Developments

  • Direct Training on Vision Tasks: Closing the gap with vision specialists will likely require MFMs to be trained directly on dense vision tasks, possibly via multi-task or multi-modal objectives that include pixel-wise supervision.
  • Unified Benchmarks: The open-sourcing of the evaluation framework will facilitate standardized benchmarking of future MFMs, enabling the community to track progress across both semantic and geometric axes.
  • Automated Prompt Optimization: Developing automated or learning-based prompt optimization strategies could further unlock the latent capabilities of MFMs on structured tasks.

Conclusion

This work establishes a rigorous, extensible benchmark for evaluating the visual understanding of multimodal foundation models on standard computer vision tasks. The results provide a clear, quantitative baseline for the current generation of MFMs, highlight the semantic-geometric performance gap, and point toward promising directions for model and benchmark development. The findings underscore the need for continued research into both model architectures and evaluation methodologies to realize the full potential of multimodal AI systems in vision-centric domains.