MMRB2: Unified Benchmark for Reward Models
- Multimodal RewardBench 2 (MMRB2) is a unified benchmark designed for systematic evaluation of reward models in mixed text and image scenarios.
- It employs rigorous ensemble filtering and expert human annotation to deliver high-consensus, nontrivial multimodal preference data.
- Performance metrics expose significant modality biases and alignment gaps, guiding future improvements in omni-modal reward modeling.
Multimodal RewardBench 2 (MMRB2) is a large-scale, unified benchmark for the systematic evaluation and development of reward models ("judges") operating over omni-modal generative architectures, with particular emphasis on interleaved text and image processing. MMRB2 addresses the limitations of previous benchmarks by providing high-consensus, expert-curated preference data for both multimodal understanding and generation. The benchmark is constructed to stress-test reward models on four principal subtasks using rigorous filtering and annotation protocols. Leading commercial and open-source judges—including Gemini 3 Pro, GPT-5, and Qwen3-VL-32B—are systematically compared against human-level annotation, revealing significant gaps and modality biases that point to future research needs in RM design and evaluation (Hu et al., 18 Dec 2025).
1. Motivation and Theoretical Foundations
MMRB2 responds to two core deficiencies in reward modeling for LLMs. First, current RMs are text- or vision-centric and do not generalize to omni-modal architectures that interleave text and images within a single dialogue, as typified by "omni models" (e.g., GPT-4o, Gemini 3 Pro). Second, the lack of reliable, multimodal preference data impedes precise model alignment for use-cases requiring complex cross-modal reasoning, narrative, or plan synthesis. Existing automatic metrics, such as CLIPScore or TIFA, are shown to miss fine-grained errors in composition, style, attribute binding, or context integration, driving the need for manual, high-fidelity evaluation protocols (Hu et al., 18 Dec 2025).
2. Benchmark Structure and Subtask Specification
MMRB2 is organized around four core subtasks. Each subtask comprises 1 000 carefully selected prompt–pair instances, spanning practical and challenging scenario designs:
| Subtask | Description | Judged Aspects |
|---|---|---|
| Text-to-Image | Prompted image synthesis from textual input | Object composition, spatial, details |
| Image Editing | Targeted edits to input images per instructions | Edit faithfulness, region integrity |
| Interleaved Generation | Mixed text–image output (multi-step guides, etc.) | Coherence, planning, cross-modal |
| Multimodal Reasoning | Visual puzzles or inference (sketches) | Correctness, stepwise reasoning |
Each subtask leverages prompts stratified from 21 public and new datasets (e.g., WISE, DreamBench, ISG-Bench, VisuLogic), and responses are harvested from 23 state-of-the-art models and agentic pipelines, both API-accessed and open-source.
3. Annotation Pipeline and Consensus Filtering
To maximize labeling fidelity, MMRB2 adopts a hybrid filtering and annotation protocol:
- Ensemble Filtering: Nine advanced multimodal judges (GPT-5, Gemini 2.5/3 Pro/Flash, GPT-4o, Gemma, Qwen2.5-VL, etc.) score each prompt–response pairers twice, in both orientations (A vs B, B vs A).
- Trivial Pair Pruning: Pairs with ≥90% model agreement are excluded, ensuring the retained cohort is nontrivial and challenging for synthetic judges.
- Human Annotation: Three expert annotators review each pair via a detailed rubric (faithfulness, quality, consistency) on a 7-point Likert scale. Pairs with high inter-annotator spread (>4) or mean ratings near indifference (3.0–4.0) are removed. Task-level human agreement surpasses 95.3%, yielding 4 000 high-consensus preference pairs total.
- Editor’s term: “Positionally consistent dual evaluation” is systematically applied to counteract model position bias, resulting in 8 000 binary judgments per round.
4. Evaluation Metrics and Correlative Methodology
MMRB2 evaluates judges using strict concordance with human majority preference. Binary accuracy per subtask is computed as:
where is the judge’s prediction and the human consensus label. To further probe downstream utility, judges are used in “Best-of-N” sampling for candidate generation on four aligned evaluation suites, with the top model output per prompt selected according to RM scores. The mean gain in downstream task metrics (e.g., GenAI-Bench, EMMA) is then analyzed, with Pearson’s computed between judge MMRB2 accuracy and average downstream improvement. The observed validates MMRB2 as a robust predictive benchmark for real-world model selection efficacy.
5. Comparative Results and Performance Analysis
The following table summarizes core accuracy results across subtasks for leading judges, demonstrating substantial distance from human agreement (>90%):
| Judge | T2I | Editing | Interleaved | Reasoning |
|---|---|---|---|---|
| Gemini 3 Pro | 74.4 | 74.9 | 76.4 | 79.5 |
| GPT-5 | 70.5 | 73.8 | 74.4 | 70.2 |
| Gemini 2.5 Pro | 70.5 | 71.3 | 75.1 | 66.6 |
| Qwen3-VL-32B | 64.1 | 67.3 | 70.5 | 56.6 |
| GPT-4o | 60.3 | 65.0 | 61.5 | 51.9 |
Gemini 3 Pro consistently leads at 74–80%, yet is 10–16 points below human consensus, highlighting persistent alignment gaps. GPT-4o underperforms relative to prior open-source models, suggesting obsolescence as an automatic evaluator. Judging same-model output pairs is noted as 5–12 points more challenging than cross-model pairs, indicating the fine-grained discrimination gap.
6. Insights, Modality Biases, and Limitations
Analysis uncovers modality-specific biases and outstanding research issues:
- Image-Presence Bias: Judges favor responses containing images by 28–49 points on mixed-modal reasoning, even when human consensus selects text-only answers, revealing significant overreliance on visual modalities.
- Reward Model Specialization: Preference-trained RMs (ImageReward, VQAScore, UnifiedReward) improve over basic heuristics but lag behind leading multimodal judges, likely due to training distribution and architectural constraints.
- Voting/Scaling Effects: Majority-voting over runs improves judge accuracy by only ~1% for proprietary models, offering limited robustness gains.
- A plausible implication is that future architectures should explicitly debias cross-modal evaluation and amplify discrimination between nuanced, attribute-differentiated outputs.
7. Future Work and Open Research Directions
MMRB2 points toward several strategic avenues for next-generation reward model research:
- Architectures designed for enhanced same-model discrimination and fine-grained attribute evaluation.
- Debiasing mechanisms for image-presence and cross-modal preferences, potentially via contrastive multimodal losses.
- Expansion beyond text and image to agentic multi-turn dialogues, additional modalities (video, audio), safety/bias-sensitive annotation, and multilingual preference modeling.
- Joint RLHF on high-fidelity, multi-modal preference data to approach human-level judgment and drive alignment in generalist, omni-modal generative systems.
MMRB2 establishes a comprehensive, predictive, and technically rigorous foundation for evaluation and development of reward models guiding next-generation interleaved text–image generative architectures (Hu et al., 18 Dec 2025). This suggests its adoption as a de facto standard for future omni-model and judge research spanning practical, multimodal AI deployment.