Clarify the semantics of GPT-4o "logprobs" outputs

Determine the precise semantics and computation of the "logprobs" values returned by the GPT-4o API for output tokens when GPT-4o is used as a visual question answering judge in text-to-image evaluation (e.g., for computing VQAScore), including whether these values correspond to calibrated log-probabilities over the model’s token distribution and how decoding settings affect them, to enable reproducible and comparable metric calculations.

Background

Many text-to-image evaluation methods such as VQAScore rely on a visual question answering model’s token-level probabilities (often provided as log probabilities) to compute soft alignment scores. While GPT-4o is widely used as a proprietary VQA judge, its closed-source nature complicates reproducibility and interpretability of these probability-based metrics.

In the appendix, the authors explicitly note uncertainty about what GPT-4o’s API returns as token "logprobs," raising a concrete question about the definition and calibration of these values. Resolving this would clarify how to interpret GPT-4o-derived scores and ensure fair comparison across evaluation methods and models.

References

It is further worth noting that as GPT-4o is closed-source, it is unclear exactly what is being returned as the "logprobs" of various output tokens.

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation (2512.16853 - Kamath et al., 18 Dec 2025) in Appendix, Section "Evaluation Methods with Different Underlying VQA Models"