- The paper introduces a prompt chaining framework that decomposes dense vision tasks into manageable, text-based sub-tasks.
- The study finds that GPT-4o excels in semantic tasks like classification (77.2% on ImageNet) but lags in geometric tasks compared to specialist models.
- The results indicate that reasoning augmentation and refined prompt engineering can partially bridge the gap in complex, structured vision challenges.
Evaluation of GPT-4o and Multimodal Foundation Models on Standard Computer Vision Tasks
This paper presents a comprehensive empirical evaluation of leading multimodal foundation models (MFMs), including GPT-4o, Gemini 1.5 Pro and 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, and Llama 3.2, on a suite of classical computer vision tasks: image classification, object detection, semantic segmentation, grouping, depth estimation, and surface normal prediction. The paper addresses the challenge of benchmarking MFMs—whose primary output modality is text—on tasks that traditionally require dense, structured outputs, such as pixel-wise segmentation or 3D geometry prediction.
Methodology
The authors introduce a prompt chaining framework that decomposes each vision task into a sequence of sub-tasks, each solvable via text-based prompts. For example, object detection is reformulated as a recursive grid search, where the model is queried about the presence of objects in image regions, progressively narrowing down bounding boxes. Semantic segmentation is approached by clustering images into superpixels and classifying each group, leveraging the models' relative strength in image classification. Depth and surface normal estimation are cast as pairwise region ranking problems, with global rankings inferred via optimization.
This approach enables the evaluation of MFMs through their public APIs, without requiring access to model weights or internal representations. The framework is not proposed as a practical solution for deploying MFMs in production vision pipelines, but rather as a standardized benchmarking tool to assess and compare their visual understanding capabilities.
Experimental Results
Across all tasks, MFMs—despite their generalist training—consistently underperform compared to state-of-the-art vision specialist models. For instance, in image classification on ImageNet, GPT-4o achieves 77.2% top-1 accuracy, trailing behind specialist models like Model Soups ViT-G (90.94%). In object detection (COCO), GPT-4o attains an AP of 31.87, compared to 80.23 for Co-DETR. For semantic segmentation (COCO), GPT-4o reaches 44.89 mIoU, while OneFormer achieves 65.52.
2. Semantic vs. Geometric Tasks
A key finding is that MFMs perform notably better on semantic tasks (classification, segmentation, grouping) than on geometric tasks (depth, surface normals). For depth estimation, the best MFMs achieve a Spearman correlation of ~0.54, compared to 0.95 for Omnidata. For surface normals, most MFMs fail to achieve positive correlation along the horizontal axis, indicating a lack of robust 3D understanding.
3. Model Ranking and Prompt Sensitivity
GPT-4o is the strongest non-reasoning MFM, ranking first in 4 out of 6 tasks. Gemini 2.0 Flash and 1.5 Pro follow, with Claude 3.5 Sonnet, Qwen2-VL, and Llama 3.2 trailing. The paper also evaluates reasoning-augmented models (o1, o3, o4-mini), which show improved performance on geometric tasks, suggesting that explicit reasoning capabilities can partially compensate for the lack of direct geometric supervision.
Prompt chaining significantly outperforms direct prompting for structured tasks, and better-performing models exhibit less sensitivity to prompt variations. However, all models remain susceptible to prompt design, especially for tasks requiring fine-grained localization or geometric reasoning.
4. Image Generation Capabilities
Preliminary analysis of GPT-4o's native image generation reveals that, when tasked with producing dense outputs (e.g., segmentation masks), the model tends to generate semantic recreations rather than precise, spatially aligned edits. This leads to hallucinations and misalignments, limiting the utility of current image generation features for dense prediction tasks.
5. Generalization and Data Contamination
To address concerns about data contamination, the authors evaluate MFMs on "in-the-wild" images released after the models' training cutoffs. The models demonstrate reasonable generalization, but the performance gap with vision specialists persists.
6. Computational and Cost Considerations
The prompt chaining framework incurs significant computational and monetary costs due to the large number of API calls required for dense tasks. While this is acceptable for benchmarking, it is prohibitive for practical deployment.
Implications and Future Directions
Practical Implications
- MFMs as Generalists: While MFMs like GPT-4o are not competitive with vision specialists on dense prediction tasks, their respectable performance as generalists—especially on semantic tasks—suggests utility in scenarios where task diversity and flexibility are prioritized over peak accuracy.
- Prompt Engineering: The strong dependence on prompt design highlights the need for systematic prompt optimization and possibly automated prompt search for robust benchmarking and deployment.
- API-Only Evaluation: The prompt chaining methodology provides a viable path for benchmarking closed-source MFMs, enabling fair comparison with open-weight and specialist models.
Theoretical Implications
- Semantic Bias: The superior performance on semantic tasks indicates that current MFM training regimes, dominated by image-text data, bias models toward semantic understanding at the expense of geometric reasoning.
- Reasoning Augmentation: The improvements observed in reasoning-augmented models (o1, o3, o4-mini) on geometric tasks suggest that explicit reasoning modules or training objectives may help bridge the gap in 3D understanding.
Limitations and Open Questions
- Efficiency: The high inference cost of prompt chaining precludes its use in real-time or large-scale applications.
- Upper Bounds: The framework's reliance on superpixels and region-based queries introduces algorithmic constraints, but control baselines (oracle + chain) demonstrate that finer granularity can improve results, indicating that current MFM performance is not bottlenecked by the evaluation method.
- Image Generation: Current image generation capabilities in MFMs are insufficient for dense prediction tasks due to spatial misalignment and hallucinations. Further research is needed to align generative outputs with structured vision requirements.
Future Developments
- Direct Training on Vision Tasks: Closing the gap with vision specialists will likely require MFMs to be trained directly on dense vision tasks, possibly via multi-task or multi-modal objectives that include pixel-wise supervision.
- Unified Benchmarks: The open-sourcing of the evaluation framework will facilitate standardized benchmarking of future MFMs, enabling the community to track progress across both semantic and geometric axes.
- Automated Prompt Optimization: Developing automated or learning-based prompt optimization strategies could further unlock the latent capabilities of MFMs on structured tasks.
Conclusion
This work establishes a rigorous, extensible benchmark for evaluating the visual understanding of multimodal foundation models on standard computer vision tasks. The results provide a clear, quantitative baseline for the current generation of MFMs, highlight the semantic-geometric performance gap, and point toward promising directions for model and benchmark development. The findings underscore the need for continued research into both model architectures and evaluation methodologies to realize the full potential of multimodal AI systems in vision-centric domains.