DISCODE: Distribution-Aware Score Decoder
- The paper introduces DISCODE, a framework that blends LVLM token probabilities with a Gaussian prior to align scores with human annotations.
- DISCODE employs an analytic, test-time adaptation algorithm that recalibrates decoder parameters per sample without the need for finetuning or extra data.
- Empirical validation on the MCEval benchmark shows DISCODE's superior performance, achieving 83.6% accuracy in predicting human-preferred captions across diverse domains.
Distribution-Aware Score Decoder (DISCODE) is a test-time adaptive evaluation framework designed to robustly score image–caption pairs using large vision-LLMs (LVLMs) under domain shift, without requiring finetuning or held-out calibration data. DISCODE directly addresses shortcomings in previous approaches to reference-free caption evaluation, particularly their proneness to “symbolic bias” and lack of robustness in out-of-domain settings. At its core, DISCODE blends the frozen LVLM’s token probabilities with a unimodal Gaussian prior, using an analytic solution to generate evaluation distributions better aligned with the statistical structure of human scores (Inoue et al., 16 Dec 2025).
1. Background and Motivation
Conventional automatic image caption evaluators commonly exploit frozen LVLMs by extracting the output token distribution over discrete scoring tokens, then “smoothing” these distributions (e.g., via direct token averaging). This approach is illustrated by prior metrics such as FLEUR and G-VEval, but is limited by two issues: (1) symbolic bias—where particular score tokens are over-represented due to model artifacts, and (2) mismatch to the empirical score distribution generated by human annotators, which tends to be Gaussian and unimodal due to the Central Limit Theorem. These pathologies are exacerbated under domain shift, where training and evaluation domains diverge (e.g., from photographs to sketches or infographics), making existing reference-free metrics unreliable. DISCODE was introduced to remedy these deficits by adaptively recalibrating the evaluation score distribution, per example, to jointly consider the LVLM’s own token output and an idealized Gaussian prior reflecting human judgment tendencies (Inoue et al., 16 Dec 2025).
2. Mathematical Framework: Adaptive Test-Time Loss and Analytic Solution
For a given image–caption pair, let be the discrete set of scoring tokens. The LVLM outputs a distribution over (with as the most probable token) and exposes its decoder feature . DISCODE parameterizes a new decoder head to produce a candidate evaluation distribution . At test time, is set by minimizing the Adaptive Test-Time (ATT) loss: where:
- is the cross-entropy between the candidate and the LVLM’s native output ,
- is a discrete Gaussian prior centered at : ,
- is a weighted Kullback–Leibler divergence,
- adapts the trust in the LVLM’s prediction versus the unimodal prior, specifically with the midpoint and .
Minimization of admits a closed-form solution: where are the original LVLM vocabulary head parameters. This formulation ensures the final evaluation distribution is both faithful to the model's internal evidence and regularized toward the empirical statistics of human raters. No iterative optimization is necessary; the procedure is strictly analytical and performed per test example (Inoue et al., 16 Dec 2025).
3. Test-Time Adaptation Algorithm
The DISCODE workflow executes as follows for each evaluation instance:
- Prompt the LVLM to generate the raw score token , extract both and .
- Compute the discrete Gaussian prior centered at .
- Calculate the mixing parameter according to the position of on the scale.
- Derive adapted decoder parameters and via the analytic formula.
- Compute the final evaluation distribution .
- Report as score .
This test-time adaptation is strictly example-specific, adds less than 1% inference overhead, and requires no access to held-out data or finetuning steps (Inoue et al., 16 Dec 2025).
4. Experimental Validation: MCEval Benchmark and Comparative Analysis
To rigorously assess domain-robustness, the authors introduced the Multi-domain Caption Evaluation (MCEval) benchmark. MCEval consists of human-annotated caption comparisons in six distinct domains: photographs, paintings, line sketches, QuickDraw doodles, clip art, and infographics. Each image yields four candidate captions, curated and judged by multiple annotators, yielding 6,000 images and 18,000 evaluation pairs (including reference-based variants). The principal metric is the accuracy in predicting human-preferred captions (Inoue et al., 16 Dec 2025).
Empirical results show that DISCODE (using LLaVA-Next-72B as backbone) achieves an overall accuracy of 83.6%, surpassing FLEUR (82.5%), G-VEval/GPT-4o (81.0%), CLIP-Score (75.3%), and PAC-Score (69.1%). The absolute improvement over FLEUR ranges from +0.5% (infographics) to +2.2% (QuickDraw) per domain. On established real-image benchmarks (Flickr8k-Expert, Pascal-50S, Composite), DISCODE matches or slightly exceeds the strongest prior LVLM-based metrics. Ablation experiments confirm the importance of both the cross-entropy and divergence terms, as well as the dynamic, position-dependent ; replacing the weighted KL with Jensen–Shannon or -divergence leads to inferior results (Inoue et al., 16 Dec 2025).
5. Strengths, Limitations, and Applicability
Strengths of DISCODE include:
- Finetuning-free operation, with no gradient steps required for calibration,
- Fully analytical, per-sample adaptation yielding negligible computational overhead,
- Superior robustness under severe domain shift relative to prior methods,
- Applicability to any open-source LVLM with accessible decoder features.
The principal limitations are:
- Necessity of access to the LVLM’s feature vector and classifier head , rendering DISCODE inapplicable to closed-source black-box models such as GPT-4o,
- Configuration is tailored for discrete rating scales, with open-ended feedback beyond the current scope.
A plausible implication is that DISCODE's analytic, plug-in test-time recalibration can serve as a template for robustifying LVLM-based automatic metrics in other contexts where alignment to human-like distributional output is desirable (Inoue et al., 16 Dec 2025).
6. Extensions and Broader Research Connections
The DISCODE framework admits several avenues for extension:
- Incorporation of richer, potentially multimodal priors (e.g., referencing explicit reference captions),
- Dynamic, metadata-driven calibration of adaptation parameters such as and prior variance,
- Applications to other evaluation paradigms such as VQA, dialogue response ratings, or hybrid reference-free/reference-based evaluation,
- Potential adaptation, as demonstrated by the RefDISCODE supplement, to hybrid settings that already outperform many traditional reference-based metrics.
DISCODE exemplifies the broader class of “distribution-aware score decoders”—approaches that tailor their output distribution at test time to reflect both the model’s own beliefs and the statistical structure of target application domains, improving cross-domain generalization (Inoue et al., 16 Dec 2025).