- The paper introduces a prompt chaining framework that decomposes vision tasks into text-promptable sub-tasks to harness multimodal model strengths.
- The paper finds that GPT-4o excels in semantic classification but underperforms in object detection and dense geometric predictions.
- The paper reveals that while multimodal foundation models serve as capable generalists, they lag behind specialist vision models in precision metrics.
Evaluation of GPT-4o and Multimodal Foundation Models on Standard Computer Vision Tasks
This paper presents a comprehensive empirical evaluation of leading multimodal foundation models (MFMs), including GPT-4o, Gemini 1.5 Pro and 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL, and Llama 3.2, on a suite of classical computer vision tasks: image classification, object detection, semantic segmentation, grouping, depth estimation, and surface normal prediction. The evaluation is conducted on established datasets such as ImageNet, COCO, and Hypersim, with a focus on quantifying the visual understanding capabilities of MFMs relative to state-of-the-art vision specialist models.
Methodology: Prompt Chaining for Vision Tasks
A central challenge addressed in this work is the incompatibility between the text-based output interfaces of most MFMs and the structured outputs required by standard vision tasks (e.g., segmentation masks, bounding boxes, dense depth maps). The authors introduce a prompt chaining framework that decomposes each vision task into a sequence of text-promptable sub-tasks, leveraging the models' strengths in image classification and visual reasoning. For example:
- Object Detection: The image is recursively partitioned into grids, and the model is prompted to identify the presence of target objects in each cell, progressively narrowing down the location.
- Semantic Segmentation: Images are over-segmented into superpixels, and the model is prompted to classify each superpixel, with multi-scale context provided to improve accuracy.
- Depth and Surface Normals: Pairwise comparisons between superpixels are used to infer relative depth or normal orientation, with global rankings obtained via optimization.
This approach enables a standardized, API-compatible benchmarking protocol for MFMs, facilitating direct comparison with vision specialists under controlled algorithmic constraints.
Experimental Results
Image Classification
MFMs, particularly GPT-4o and Gemini 2.0 Flash, achieve strong results across ImageNet and its variants, demonstrating robustness to distribution shifts and corruptions. However, all MFMs fall short of the top-performing vision specialists (e.g., Model Soups ViT-G, OpenCLIP H), with a gap of approximately 7–14 percentage points in top-1 accuracy on ImageNet.
Object Detection
All MFMs underperform relative to specialist models such as Co-DETR and DETR. GPT-4o achieves the highest MFM performance (AP50: 60.62), but this remains significantly below the specialist baseline (Co-DETR AP50: 91.30). The prompt chaining approach is shown to be critical; direct regression of bounding box coordinates by MFMs yields substantially worse results.
Semantic Segmentation and Grouping
MFMs exhibit nontrivial segmentation capabilities, with GPT-4o again leading among MFMs (mIoU: 44.89), but a substantial gap persists compared to OneFormer (mIoU: 65.52). For grouping, GPT-4o and Gemini 2.0 Flash perform best among MFMs, but all models lag behind the Segment Anything Model (SAM).
Depth and Surface Normal Prediction
Performance on geometric tasks is notably weaker. All MFMs are outperformed by Omnidata and 4M-21, with the gap more pronounced than in semantic tasks. Reasoning-focused models (e.g., o4-mini, o1, o3) show relative improvements in geometric reasoning, particularly in surface normal prediction, where they correct common failure modes observed in GPT-4o.
Prompt Sensitivity and Cost
Prompt chaining significantly improves performance over direct prompting, especially for structured tasks. Better-performing models (e.g., GPT-4o) exhibit lower sensitivity to prompt variations. The evaluation framework incurs substantial computational and monetary cost due to the large number of API calls required for dense predictions.
Image Generation Capabilities
Preliminary analysis of GPT-4o's native image generation reveals that outputs tend to be semantic recreations rather than precise per-pixel edits, leading to hallucinations and spatial misalignments. This limits the direct applicability of current image generation features for dense vision tasks.
Key Findings and Claims
- MFMs are not competitive with state-of-the-art vision specialists on any standard vision task.
- MFMs are strong generalists: Despite being trained primarily on image-text data, they achieve respectable performance across a wide range of tasks.
- Semantic understanding exceeds geometric reasoning: MFMs perform better on tasks requiring semantic classification than on those requiring geometric or 3D understanding.
- Prompt chaining is essential: Decomposing tasks into sub-tasks aligned with the models' strengths is necessary for extracting maximal performance.
- Reasoning models show promise for geometric tasks: Recent models with explicit reasoning capabilities (e.g., o1, o3, o4-mini) demonstrate improved performance on depth and surface normal estimation.
- Image generation outputs are not yet suitable for dense vision evaluation: Current models hallucinate and misalign outputs, indicating a need for further research.
Implications and Future Directions
Practical Implications
- Benchmarking: The prompt chaining framework provides a standardized method for evaluating MFMs on vision tasks, enabling fair comparison with vision specialists and across MFMs.
- Deployment: While MFMs can serve as generalist vision systems, their performance is insufficient for applications requiring high-precision geometric understanding or dense predictions.
- Prompt Engineering: Careful design of prompt chains is critical for leveraging MFM capabilities in structured vision tasks; direct prompting is inadequate.
Theoretical Implications
- Training Regimes: The observed gap between MFMs and vision specialists suggests that current multimodal pretraining (dominated by image-text pairs) is insufficient for learning fine-grained geometric representations.
- Model Architecture: The success of reasoning-augmented MFMs on geometric tasks points to the value of explicit reasoning modules or training objectives targeting 3D understanding.
Future Developments
- Closing the Gap: Future MFMs may incorporate direct supervision on dense vision tasks, or integrate architectural inductive biases from vision specialists, to bridge the performance gap.
- Efficient Evaluation: Reducing the computational and monetary cost of benchmarking dense vision tasks with MFMs is an open challenge, potentially addressable via more efficient prompt chaining or model-side adaptations.
- Robust Image Generation: Improving the fidelity and spatial alignment of image generation outputs is necessary for MFMs to be evaluated (and deployed) on dense prediction tasks without prompt chaining.
Conclusion
This work establishes a rigorous, extensible benchmark for quantifying the visual understanding of MFMs on standard computer vision tasks. The results highlight both the versatility and current limitations of MFMs, particularly in geometric reasoning and dense prediction. The prompt chaining methodology and open-source evaluation suite provide a foundation for tracking progress as future MFMs evolve toward more comprehensive visual intelligence.