Identify a usable bounding-box coordinate format for certain commercial MLLMs

Identify an alternative bounding-box coordinate output format that yields usable visual grounding predictions from GPT-5, Claude-Sonnet-4.5, and Grok-4 on the GroundingME benchmark, given that both absolute pixel coordinates and 0–999 normalized relative coordinates produced substantially displaced and distorted boxes under the unified evaluation prompt.

Background

The authors attempted to evaluate several commercial models (GPT-5, Claude-Sonnet-4.5, Grok-4) on GroundingME but encountered unusable bounding-box outputs. They tested the benchmark’s unified prompt template and interpreted outputs as both absolute pixel coordinates and 0–999 normalized relative coordinates, but observed severe displacement and distortion.

They note that Gemini-2.5 benefited from a prompt adjustment to a different coordinate order, suggesting that output format sensitivity can be critical. However, for GPT-5, Claude-Sonnet-4.5, and Grok-4, they were unable to find any coordinate format that produced usable results, leaving unresolved how to elicit correct bounding-box coordinates from these systems in this evaluation setting.

References

From cases in \cref{tab:commercial_case}, we observe that the coordinates produced by these models using the unified prompt template (\cref{tab:eval_prompt}) suffered from substantial displacement and distortion, regardless of whether the output is interpreted as absolute pixel coordinates (red bounding box) or 0-999 normalized relative coordinates (blue bounding box). Furthermore, we fail to find an alternative coordinate format that yields usable results for these models.

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation  (2512.17495 - Li et al., 19 Dec 2025) in Appendix, Section "Commercial Model Notes" (around Table \cref{tab:commercial_case})