Papers
Topics
Authors
Recent
Search
2000 character limit reached

DISCODE: Distribution-Aware Score Decoder

Updated 23 December 2025
  • The paper introduces DISCODE, a framework that blends LVLM token probabilities with a Gaussian prior to align scores with human annotations.
  • DISCODE employs an analytic, test-time adaptation algorithm that recalibrates decoder parameters per sample without the need for finetuning or extra data.
  • Empirical validation on the MCEval benchmark shows DISCODE's superior performance, achieving 83.6% accuracy in predicting human-preferred captions across diverse domains.

Distribution-Aware Score Decoder (DISCODE) is a test-time adaptive evaluation framework designed to robustly score image–caption pairs using large vision-LLMs (LVLMs) under domain shift, without requiring finetuning or held-out calibration data. DISCODE directly addresses shortcomings in previous approaches to reference-free caption evaluation, particularly their proneness to “symbolic bias” and lack of robustness in out-of-domain settings. At its core, DISCODE blends the frozen LVLM’s token probabilities with a unimodal Gaussian prior, using an analytic solution to generate evaluation distributions better aligned with the statistical structure of human scores (Inoue et al., 16 Dec 2025).

1. Background and Motivation

Conventional automatic image caption evaluators commonly exploit frozen LVLMs by extracting the output token distribution over discrete scoring tokens, then “smoothing” these distributions (e.g., via direct token averaging). This approach is illustrated by prior metrics such as FLEUR and G-VEval, but is limited by two issues: (1) symbolic bias—where particular score tokens are over-represented due to model artifacts, and (2) mismatch to the empirical score distribution generated by human annotators, which tends to be Gaussian and unimodal due to the Central Limit Theorem. These pathologies are exacerbated under domain shift, where training and evaluation domains diverge (e.g., from photographs to sketches or infographics), making existing reference-free metrics unreliable. DISCODE was introduced to remedy these deficits by adaptively recalibrating the evaluation score distribution, per example, to jointly consider the LVLM’s own token output and an idealized Gaussian prior reflecting human judgment tendencies (Inoue et al., 16 Dec 2025).

2. Mathematical Framework: Adaptive Test-Time Loss and Analytic Solution

For a given image–caption pair, let S={0,1,...,9}S = \{0, 1, ..., 9\} be the discrete set of scoring tokens. The LVLM outputs a distribution p0(s)p_0(s) over SS (with s0s_0 as the most probable token) and exposes its decoder feature hTRdh_T \in \mathbb{R}^d. DISCODE parameterizes a new decoder head ψθ(h)=softmax(Wh+b)\psi_\theta(h) = \mathrm{softmax}(W^\top h + b) to produce a candidate evaluation distribution p(s)=ψθ(hT;θ)p(s) = \psi_\theta(h_T; \theta). At test time, θ\theta is set by minimizing the Adaptive Test-Time (ATT) loss: LATT(θ;hT)=H(p,p0)+Dα(pq)\mathcal{L}_\mathrm{ATT}(\theta; h_T) = H(p, p_0) + D_\alpha(p\|q) where:

  • H(p,p0)H(p, p_0) is the cross-entropy between the candidate pp and the LVLM’s native output p0p_0,
  • qq is a discrete Gaussian prior centered at s0s_0: q(s)exp((ss0)22)q(s) \propto \exp\left(-\frac{(s - s_0)^2}{2}\right),
  • Dα(pq)=(1α)H(p,q)αH(p,p)D_\alpha(p\|q) = (1-\alpha)H(p, q) - \alpha H(p, p) is a weighted Kullback–Leibler divergence,
  • α[0,1]\alpha \in [0,1] adapts the trust in the LVLM’s prediction versus the unimodal prior, specifically α=[2πσ2exp((s0μ)2/(2σ2))]1\alpha = [\sqrt{2\pi\sigma^2}\exp(-(s_0 - \mu)^2/(2\sigma^2))]^{-1} with μ\mu the midpoint and σ2=0.1\sigma^2=0.1.

Minimization of LATT\mathcal{L}_\mathrm{ATT} admits a closed-form solution: W^=1αV,b^=1ααlogq+1αc\hat{W} = \frac{1}{\alpha}V,\qquad \hat{b} = \frac{1-\alpha}{\alpha}\log q + \frac{1}{\alpha}c where (V,c)(V, c) are the original LVLM vocabulary head parameters. This formulation ensures the final evaluation distribution p(s)p(s) is both faithful to the model's internal evidence and regularized toward the empirical statistics of human raters. No iterative optimization is necessary; the procedure is strictly analytical and performed per test example (Inoue et al., 16 Dec 2025).

3. Test-Time Adaptation Algorithm

The DISCODE workflow executes as follows for each evaluation instance:

  1. Prompt the LVLM to generate the raw score token s0Ss_0\in S, extract both p0(s)p_0(s) and hTh_T.
  2. Compute the discrete Gaussian prior q(s)q(s) centered at s0s_0.
  3. Calculate the mixing parameter α\alpha according to the position of s0s_0 on the scale.
  4. Derive adapted decoder parameters WW' and bb' via the analytic formula.
  5. Compute the final evaluation distribution p(s)=softmax(WhT+b)p(s) = \mathrm{softmax}(W'^\top h_T + b').
  6. Report as score s^=Esp[s]\hat{s} = \mathbb{E}_{s\sim p}[s].

This test-time adaptation is strictly example-specific, adds less than 1% inference overhead, and requires no access to held-out data or finetuning steps (Inoue et al., 16 Dec 2025).

4. Experimental Validation: MCEval Benchmark and Comparative Analysis

To rigorously assess domain-robustness, the authors introduced the Multi-domain Caption Evaluation (MCEval) benchmark. MCEval consists of human-annotated caption comparisons in six distinct domains: photographs, paintings, line sketches, QuickDraw doodles, clip art, and infographics. Each image yields four candidate captions, curated and judged by multiple annotators, yielding 6,000 images and 18,000 evaluation pairs (including reference-based variants). The principal metric is the accuracy in predicting human-preferred captions (Inoue et al., 16 Dec 2025).

Empirical results show that DISCODE (using LLaVA-Next-72B as backbone) achieves an overall accuracy of 83.6%, surpassing FLEUR (82.5%), G-VEval/GPT-4o (81.0%), CLIP-Score (75.3%), and PAC-Score (69.1%). The absolute improvement over FLEUR ranges from +0.5% (infographics) to +2.2% (QuickDraw) per domain. On established real-image benchmarks (Flickr8k-Expert, Pascal-50S, Composite), DISCODE matches or slightly exceeds the strongest prior LVLM-based metrics. Ablation experiments confirm the importance of both the cross-entropy and divergence terms, as well as the dynamic, position-dependent α\alpha; replacing the weighted KL with Jensen–Shannon or β\beta-divergence leads to inferior results (Inoue et al., 16 Dec 2025).

5. Strengths, Limitations, and Applicability

Strengths of DISCODE include:

  • Finetuning-free operation, with no gradient steps required for calibration,
  • Fully analytical, per-sample adaptation yielding negligible computational overhead,
  • Superior robustness under severe domain shift relative to prior methods,
  • Applicability to any open-source LVLM with accessible decoder features.

The principal limitations are:

  • Necessity of access to the LVLM’s feature vector hTh_T and classifier head (V,c)(V, c), rendering DISCODE inapplicable to closed-source black-box models such as GPT-4o,
  • Configuration is tailored for discrete rating scales, with open-ended feedback beyond the current scope.

A plausible implication is that DISCODE's analytic, plug-in test-time recalibration can serve as a template for robustifying LVLM-based automatic metrics in other contexts where alignment to human-like distributional output is desirable (Inoue et al., 16 Dec 2025).

6. Extensions and Broader Research Connections

The DISCODE framework admits several avenues for extension:

  • Incorporation of richer, potentially multimodal priors (e.g., referencing explicit reference captions),
  • Dynamic, metadata-driven calibration of adaptation parameters such as α\alpha and prior variance,
  • Applications to other evaluation paradigms such as VQA, dialogue response ratings, or hybrid reference-free/reference-based evaluation,
  • Potential adaptation, as demonstrated by the RefDISCODE supplement, to hybrid settings that already outperform many traditional reference-based metrics.

DISCODE exemplifies the broader class of “distribution-aware score decoders”—approaches that tailor their output distribution at test time to reflect both the model’s own beliefs and the statistical structure of target application domains, improving cross-domain generalization (Inoue et al., 16 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Distribution-Aware Score Decoder (DISCODE).