Papers
Topics
Authors
Recent
Search
2000 character limit reached

XAIGID-RewardBench: Multimodal Reward Evaluation

Updated 3 July 2026
  • XAIGID-RewardBench is a large-scale, human-annotated benchmark designed to assess multimodal reward models that judge both image authenticity verdicts and explanation quality.
  • It structures evaluation using preference-based triplets composed of an image, two candidate responses (verdict and explanation), and a human-annotated four-class label.
  • Empirical results reveal a human-model accuracy gap of around 9.5 percentage points, highlighting challenges in logical coherence and resistance to verbosity.

XAIGID-RewardBench is a large-scale, human-annotated benchmark specifically designed to evaluate the capacity of multimodal LLMs (MLLMs) to serve as robust, discriminative reward models—“judges”—in the context of explainable AI-generated image detection tasks. Introduced by Wang et al. (2025), it provides a ground-truth platform for assessing whether today’s MLLMs can accurately adjudicate between explanations accompanying model-automated “real/fake” decisions on image authenticity. XAIGID-RewardBench is distinctive in its explicit focus on not just the detection verdicts but the interpretability and reasoning quality of associated explanations, operationalizing the challenge of reward modeling for multimodal, human-like explanations in a way not previously realized in the alignment/LLM literature (Yang et al., 15 Nov 2025).

1. Benchmark Structure and Task Definition

XAIGID-RewardBench centers on preference-based evaluation over triplets, each constructed as follows:

  • An image II (either “real” or “fake”).
  • Two candidate detector responses, ra=(ya,ea)r_a = (y_a, e_a) and rb=(yb,eb)r_b = (y_b, e_b), each comprising a one-word verdict (y{real,fake}y \in \{\mathrm{real}, \mathrm{fake}\}) and a free-form explanation ee.
  • A human-annotated label H(t)H(t) indicating which response is superior: “First,” “Second,” “Tie,” or “BothBad.”

A reward model (judge) receives the tuple (I,ra,rb)(I, r_a, r_b) and is tasked to select which response is better, with the key innovation being the explicit assessment of explanation quality beyond mere verdict correctness. Verdicts and explanations are assessed holistically using a reduced rubric distilled to a four-way label rather than multi-criteria scores, thus balancing annotation efficiency against expressivity.

Triplets are generated using an array of real images (COCO 2017) and images synthesized by ten state-of-the-art generative models (e.g., Stable Diffusion 1.5/3.5, Imagen 3/4, Glide, GPT-Image-1) as well as detector/policy models spanning multiple MLLMs (e.g., Qwen-VL 2.5, Gemini 2.5 Flash/Pro, GPT-4o) (Yang et al., 15 Nov 2025).

2. Dataset Composition and Annotation Process

XAIGID-RewardBench comprises 3,000 expertly annotated triplets (3 per image × 1000 images; 500 real and 500 fake). Each fake sample is evenly sourced from ten synthesis models. Detector responses are obtained via seven leading MLLMs, ensuring diversity in both detection policy and explanatory style.

The annotation protocol emphasizes quality and consistency:

  • Each triplet is labeled by human annotators using the four-class standard as above.
  • Various bias controls are enforced: randomization of response order, stopwatch-based annotation filtering (<15 sec dropped), and manually adjudicated tie-breaks.
  • The annotation rubric covers logical coherence, distortion coverage, insight, self-consistency, counterargument quality, and relevance, but is distilled down to the four final labels to maximize inter-annotator reliability and operational throughput.

Inter-annotator agreement (IAA) is systematically measured: on a 100-triplet subset, humans achieve 68.0% four-way IAA and 98.3% on clear-winner cases—a definitive upper bound for model performance (Yang et al., 15 Nov 2025).

3. Evaluation Metrics and Protocol

Two primary evaluation axes are implemented:

  • Detector (policy) accuracy (Accdet\mathrm{Acc}_{\mathrm{det}}): Fraction of policy model verdicts (yy) matching the ground-truth label of the image across the test set.

Accdet(π)=1III1[y(π(I))=l(I)]\mathrm{Acc}_{\mathrm{det}}(\pi) = \frac{1}{|\mathcal{I}|} \sum_{I \in \mathcal{I}} \mathbf{1}[y(\pi(I))=l(I)]

  • Judge (reward model) accuracy: The fraction of model predictions matching human-labeled preference on triplets.

    • 4-way accuracy:

    ra=(ya,ea)r_a = (y_a, e_a)0 - 2-way accuracy (for “clear winner” subset where human label is not “Tie”/“BothBad”):

    ra=(ya,ea)r_a = (y_a, e_a)1

Best accuracy is obtained by models that can both (a) replicate human discriminatory power in distinguishing explanation quality and (b) avoid confusion between semantically vacuous, verbose, or irrelevant reasoning chains.

4. Empirical Results and Analysis

On the released XAIGID-RewardBench test set:

  • The highest judge (reward model) accuracy is 88.76% on 2-way (clear winner) evaluation, achieved by Gemini 2.5 Pro, versus a human upper-bound of 98.3% (gap ≈ 9.5 p.p.).
  • The best 4-way accuracy (including ties/poor pairs) is 68.92% (Gemini 2.5 Flash).
  • By contrast, GPT-4o, Qwen 2.5, Gemma 3n, and others lag by several percentage points, despite strong underlying language and vision performance (Yang et al., 15 Nov 2025).
Model (Judge) 4-way Acc (%) 2-way Acc (%)
Gemini 2.5 Pro 68.35 88.76
Gemini 2.5 Flash 68.92 87.26
GPT-4o 60.62 85.59
Gemma 3n 4B 66.88 84.54
Qwen 2.5 7B 29.37 82.28

This 9.5 pp human–model discrepancy underscores a notable limitation in current MLLM judgment fidelity, particularly on challenging explanation reasoning.

5. Typical Failure Modes and Diagnostic Findings

Analysis of judge errors reveals several consistent failure modalities:

  • Downplaying critical artifacts: Judges sometimes excuse severe flaws (e.g., missed artifacts in fake images) as mere “hiccups.”
  • Irrelevant detail inclusion: Accurate yet causally disconnected information is weighted, often swaying scoring inappropriately.
  • Over-generalization: Models mistake large realistic image regions for global realism, overlooking small, critical irregularities.
  • Length/verbosity bias: Longer explanations are erroneously equated with quality, even when they are logically inconsistent or hallucinated.

These findings indicate that increasing model size or raw detection accuracy is not sufficient; explicit improvements to the chains of visual-linguistic reasoning and causal attribution are necessary.

6. Comparison and Complementarity with Other RewardBench Efforts

While XAIGID-RewardBench is the first to systematize evaluation of reward modeling for explainable, model-generated image detection explanations, it is conceptually related to earlier language-model-centric RewardBench efforts (Lambert et al., 2024). Notably:

  • Standard RewardBench focuses on text-only LLM reward models scoring (prompt, chosen, rejected) tuples with subtle, verifiable distinctions (e.g., “micro-errors” in math/code).
  • XAIGID-RewardBench generalizes this schema to the multimodal, explainability-centric regime, integrating both decision verdicts and natural language explanations.
  • A plausible implication is that XAI benchmarks built using this design could be extended to, or hybridized with, attribution and rationale-based RLHF evaluation pipelines, further promoting explainability and trust in automated decision support tools (Yang et al., 15 Nov 2025Lambert et al., 2024).

7. Availability, Practical Usage, and Future Directions

XAIGID-RewardBench, along with code and data, is publicly available at https://github.com/RewardBench/XAIGID-RewardBench. Models can be evaluated by re-running the reward model (judge) over the benchmark’s structured (image, response A, response B, label) triplets, computing both 2-way and 4-way agreement against human gold.

The results highlight that achieving human-comparable evaluative robustness for explanations in the multimodal setting remains an open challenge. Future directions include developing reward models with explicit logical-chain auditing, counterfactual reasoning, and better resistance to verbose or stylistically misleading explanations—directions anticipated but not yet solved by existing MLLM architectures. Further, integrating rationale-specific supervision (e.g., token-level justifications, error localization) may reduce the persistent ≈10 pp anthropomorphic gap.

XAIGID-RewardBench thus provides a foundational benchmark for quantitatively tracking progress toward truly explainable, trustworthy, and robustly aligned MLLM reward models in the era of explainable AI-generated content detection (Yang et al., 15 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XAIGID-RewardBench.