VL-RewardBench: Evaluating VL Reward Models

Updated 8 December 2025

VL-RewardBench is a comprehensive benchmark designed to rigorously evaluate vision-language reward models through multimodal tasks and human-verified datasets.
It covers key areas including visual perception, hallucination detection, and complex reasoning, with detailed curation and evaluation protocols.
The benchmark employs repeated evaluations and critic training insights to reveal practical performance challenges and guide model improvements.

VL-RewardBench is a comprehensive, human-verified benchmark designed to rigorously evaluate the judgment capabilities of vision-language reward models (VL-GenRMs) across challenging, realistic multimodal tasks. Its suite structure and dataset curation, covering perception, hallucination detection, and reasoning, make it a central stress test for recent generative and discriminative reward models in the vision-language alignment ecosystem (Li et al., 26 Nov 2024).

1. Scope and Foundational Principles

VL-RewardBench was introduced to address deficiencies in prior vision-language reward model evaluation—namely, over-reliance on AI-annotated preferences, limited coverage of perceptual hallucination, and insufficient task difficulty to differentiate top models. The benchmark targets vision-language generative reward models used as "judges" for RLHF pipelines and alignment, providing a testbed with human-verified preference labels spanning open-ended queries, hallucination detection, and complex multimodal reasoning (Li et al., 26 Nov 2024).

The design criteria are:

Diverse and realistic multimodal use cases.
Sufficient difficulty to probe model bottlenecks.
Objective, human-verified ground truth labels.

VL-RewardBench is thus both a mirror for current VL-GenRM development and a benchmark for alignment-targeted optimization.

2. Dataset Structure and Curation Pipeline

VL-RewardBench comprises 1,250 preference pairs, each consisting of an image, a natural language prompt, and two candidate answers (one "winner", one "loser"):

General Multimodal Queries (183 pairs, 14.7%): User-style instructions drawn from WildVision and VLFeedback. Example tasks include scene description and attribute querying.
Visual Hallucination Detection (749 pairs, 59.9%): Requires judges to penalize details not grounded in the image; pairs sourced from hallucination-focused datasets such as POVID, RLAIF-V, and RLHF-V.
Complex Multimodal Reasoning (318 pairs, 25.4%): Encompasses knowledge-intensive and mathematical inference requiring CoT-style evaluation, drawn from MMMU-Pro and MathVerse.

The curation process involves:

Mass collection of candidates from existing datasets (e.g., >6,000 WildVision queries, >3,000 hallucination samples, >3,000 reasoning samples).
Ensemble filtering using four weak VLLMs, retaining only pairs universally misjudged by these models (high adversarial value).
Multi-stage human verification to ensure unambiguous ground truth, removal of stylistic or low-quality samples, and categorical error tagging (existence, recognition, attribute, counting, or other).

AI-aided preference labeling is employed for reasoning-intensive tasks, with further human filtering to discard ambiguous or doubly incorrect pairs.

Dataset summary:

Subset	Count	% of Total	Primary Sources
General Multimodal	183	14.7	WildVision, VLFeedback
Hallucination Detection	749	59.9	POVID, RLAIF-V, RLHF-V
Complex Reasoning	318	25.4	MMMU-Pro, MathVerse

(Li et al., 26 Nov 2024)

3. Evaluation Protocol and Metrics

VL-RewardBench uses an LLM-as-a-Judge setup: for each test instance, the reward model is given the image and prompt along with two candidate responses in randomized order. The model must assign higher score or preference to the "winning" answer as determined during annotation.

Key procedures:

Each evaluation is repeated K times (typically K=5) per pair to control for positional bias, with majority voting used to determine the model's final judgment.
Fixed decoding parameters (temperature=0.2, top-p=0.2) ensure consistent evaluation.

Reported metrics:

Overall Accuracy:

$\text{Accuracy} = \frac{\text{Number correct}}{\text{Total pairs}}$

Macro-Average Accuracy: Arithmetic mean of per-category accuracies, compensating for class imbalance:

$\text{MacroAvg} = \frac{1}{3}(\text{Acc}_\text{Gen} + \text{Acc}_\text{Hallu} + \text{Acc}_\text{Reason})$

Downstream Correlation: Pearson correlation between VL-RewardBench accuracy and Best-of-N (BoN) sampling gain on downstream tasks (notably, MMMU-Pro).

Accuracy is always measured on pairwise preference selection; random-guess baseline is 50% (Li et al., 26 Nov 2024, Wang et al., 12 May 2025).

4. Core Difficulty Dimensions

VL-RewardBench is engineered to expose specific weaknesses in reward models:

Visual Perception: The majority of model failures (in both open and closed-source models) stem from basic scene understanding—existence, recognition, and attribute misclassification errors. For example, GPT-4o-mini has a 67.9% error rate on "existence" tasks (Li et al., 26 Nov 2024).
Hallucination Detection: 59.9% of examples require the judge to identify or punish hallucinated details. This remains a critical challenge, with open-source models performing near or below chance on this task.
Reasoning: Despite traditionally being perceived as the hardest, multimodal reasoning errors average 41.8%—lower than perception-related tasks for most models, indicating that core perception is now a stronger bottleneck.
Inference-Time Effects: Inference-time scaling (e.g., repeated judgments or test-time ensembling) offers variable benefit. For GPT-4o, macro-accuracy increases from 60.3% (K=1) to 62.7% (K=7), while for Qwen2-VL-72B, performance actually degrades with increased votes (Li et al., 26 Nov 2024).
Model Scaling and Critic Training: Scaling up model size yields only limited returns; e.g., Qwen2-VL-7B to 72B reduces counting errors by 18.2 points but only 6.0 on reasoning. Judgment-focused ("critic") training (instruction-finetuning on evaluation data) substantially boosts accuracy; LLaVA-OneVision-7B-ov improves from ~38% to 52.9% accuracy in the pointwise critic regime (+14.7%) (Li et al., 26 Nov 2024).

5. Benchmark Results and Model Performance

Peer-reviewed evaluations on VL-RewardBench report:

Model	Macro-Average	Overall Accuracy	Reasoning	Hallucination	General
GPT-4o	62.4%	65.8%	70.5%	67.6%	49.1%
Gemini-1.5-Pro	62.5%	67.2%	64.2%	72.5%	50.8%
Claude 3.5 Sonnet	53.6%	55.3%	62.3%	55.0%	43.4%
Qwen2-VL-72B	43.0%	39.5%	58.0%	32.8%	38.1%
Llama-3.2-90B	53.9%	56.2%	61.7%	57.3%	42.6%
Best open-source	53.9%	56.2%	61.7%	57.3%	42.6%

Many 7B-scale open models operate at the random choice baseline (≈50%) or below. Critic-trained or judgment-tuned models (e.g., LLaVA-Critic) exhibit strong improvement over vanilla counterparts (Li et al., 26 Nov 2024, Wang et al., 12 May 2025).

6. Downstream Impact and Correlation with Real-World Utility

VL-RewardBench performance is strongly predictive of downstream Best-of-N sampling effectiveness on non-trivial VL tasks:

Pearson’s r between VL-RewardBench accuracy and BoN gain on MMMU-Pro exceeds 0.9 across leading models.
Example: With GPT-4o as judge, BoN sampling lifts LLaVA-OneVision-7B-ov accuracy from 35.7% to 52.5% on MMMU-Pro (Δ=16.8 points), with similar predictive trends across other models (Li et al., 26 Nov 2024).

This supports the use of VL-RewardBench as a practical model-selection and validation tool in RLHF training, dataset curation, and deployment pipelines.

7. Analytical Insights and Best Practices

Analysis of VL-RewardBench results yields these conclusions:

Improvement in reward model judgment is more efficiently achieved via targeted critic training than by scaling raw model size.
Perceptual bottlenecks (existence, recognition, attribute errors) are now the major limiting factor, surpassing even reasoning in difficulty.
Vanilla self-consistency or repeated voting are not always beneficial and may decrease performance, in contrast to text-only evaluation paradigms.
Blending multimodal and text-based preference data, and leveraging specialized critic datasets, is necessary for robust generalization.
VL-RewardBench should be used alongside human-annotated validation sets to mitigate systematic bias.
VL-RewardBench offers a means to co-evolve reward models and generation models: improved reward models select better generations, and stronger VLLMs enable richer benchmarking (Li et al., 26 Nov 2024, Wang et al., 12 May 2025, Zhang et al., 19 Sep 2025).

References:

(Li et al., 26 Nov 2024, Wang et al., 12 May 2025, Lin et al., 2 Dec 2025, Zhang et al., 19 Sep 2025, Yasunaga et al., 20 Feb 2025)