Identify a usable bounding-box coordinate format for certain commercial MLLMs
Identify an alternative bounding-box coordinate output format that yields usable visual grounding predictions from GPT-5, Claude-Sonnet-4.5, and Grok-4 on the GroundingME benchmark, given that both absolute pixel coordinates and 0–999 normalized relative coordinates produced substantially displaced and distorted boxes under the unified evaluation prompt.
References
From cases in \cref{tab:commercial_case}, we observe that the coordinates produced by these models using the unified prompt template (\cref{tab:eval_prompt}) suffered from substantial displacement and distortion, regardless of whether the output is interpreted as absolute pixel coordinates (red bounding box) or 0-999 normalized relative coordinates (blue bounding box). Furthermore, we fail to find an alternative coordinate format that yields usable results for these models.