Clarify the semantics of GPT-4o "logprobs" outputs
Determine the precise semantics and computation of the "logprobs" values returned by the GPT-4o API for output tokens when GPT-4o is used as a visual question answering judge in text-to-image evaluation (e.g., for computing VQAScore), including whether these values correspond to calibrated log-probabilities over the model’s token distribution and how decoding settings affect them, to enable reproducible and comparable metric calculations.
Sponsor
References
It is further worth noting that as GPT-4o is closed-source, it is unclear exactly what is being returned as the "logprobs" of various output tokens.
— GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
(2512.16853 - Kamath et al., 18 Dec 2025) in Appendix, Section "Evaluation Methods with Different Underlying VQA Models"