Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models (2502.14191v1)

Published 20 Feb 2025 in cs.CV and cs.AI

Abstract: Reward models play an essential role in training vision-LLMs (VLMs) by assessing output quality to enable aligning with human preferences. Despite their importance, the research community lacks comprehensive open benchmarks for evaluating multimodal reward models in VLMs. To address this gap, we introduce Multimodal RewardBench, an expert-annotated benchmark covering six domains: general correctness, preference, knowledge, reasoning, safety, and visual question-answering. Our dataset comprises 5,211 annotated (prompt, chosen response, rejected response) triplets collected from various VLMs. In evaluating a range of VLM judges, we find that even the top-performing models, Gemini 1.5 Pro and Claude 3.5 Sonnet, achieve only 72% overall accuracy. Notably, most models struggle in the reasoning and safety domains. These findings suggest that Multimodal RewardBench offers a challenging testbed for advancing reward model development across multiple domains. We release the benchmark at https://github.com/facebookresearch/multimodal_rewardbench.

Summary

The paper introduces a holistic benchmark assessing VLM reward models across dimensions like correctness, reasoning, safety, and specialized knowledge.
The paper demonstrates that top models achieve around 72% accuracy, revealing notable challenges in reasoning and safety domains.
The paper highlights that scaling model parameters boosts performance unevenly, underscoring the need for targeted improvements and tailored training strategies.

Overview

The paper "Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision LLMs" (2502.14191) presents an expert-annotated benchmark specifically designed to evaluate how well reward models can align vision-LLMs (VLMs) with human preferences. The paper introduces a comprehensive evaluation framework that addresses several crucial dimensions of VLM performance, including general correctness, preference, domain-specific knowledge, logical reasoning, safety, and visual question answering (VQA). This work provides a systematic approach to benchmarking reward models with implications for both academic research and industrial applications.

Methodology

The proposed benchmark is constructed through a meticulous process that involves the collection of multimodal prompts and responses from a wide range of established datasets—such as VisitBench, Nocaps, MMMU-Pro, MathVista, among others—and the aggregation of triplets in the form of (prompt, chosen response, rejected response). Expert human annotators conduct rigorous evaluations on these triplets. In particular, the annotation process emphasizes:

Inter-annotator Agreement: High consistency in ratings, especially for dimensions like correctness, preference, reasoning, and safety.
Pilot Tasks Refinement: Iterative improvement of annotation guidelines to minimize major errors and omissions.
Category-Specific Annotations: Different scales and scoring mechanisms are applied to long-form generative tasks (e.g., detailed visual instruction following) versus short-form VQA tasks.

This comprehensive design enables the assessment of both generative and discriminative capabilities of VLM reward models across six well-defined categories.

Benchmark Description

The benchmark spans six distinct domains:

General Correctness: Evaluates factual and logical consistency in long-form generative outputs and complex instruction-following tasks.
General Preference: Focuses on subjective human-preference judgments, capturing nuances that pure correctness evaluations may miss.
Knowledge: Tests domain-specific expertise in areas like STEM, humanities, business, and medicine, with an emphasis on depth and correctness of responses.
Reasoning: Assesses the chain-of-thought and deductive capabilities, particularly for mathematical problem-solving and programmatic reasoning.
Safety: Targets the detection and mitigation of biases, toxicity, and harmful content.
Visual Question Answering: Evaluates the ability to directly interpret and logically respond to visual inputs.

The dataset comprises 5,211 annotated triplets, providing a varied corpus to stress test multimodal reward models comprehensively.

Key Findings

The evaluation of several state-of-the-art VLM judges using Multimodal RewardBench revealed several noteworthy points:

Aggregate Performance: The best-performing models, such as Gemini 1.5 Pro and Claude 3.5 Sonnet, achieved an overall accuracy of approximately 72% on the benchmark. This indicates a significant gap between current automated assessments and human-level performance.
Domain-Specific Limitations: A pervasive struggle across all models is observed in the reasoning and safety domains. This includes shortcomings in multi-step computation, deductive logic, and robust detection of biases and toxic content.
Model Scaling Effects: Empirical observations suggest that the scaling of model parameters—moving from mid-range models (e.g., Llama-3.2-Vision-Instruct 11B) up to 90B parameter settings—consistently improves accuracy across the evaluated categories, though the improvements are non-uniform across all test domains.

These results emphasize that even top-tier VLM reward models require substantial improvements in reasoning robustness and safety mitigations to achieve performance parity with human evaluators.

Practical Implications

For practitioners and researchers, the implications of this benchmark are multifold:

Holistic Model Evaluation: Multimodal RewardBench offers a unified, rigorous testbed for evaluating deployed VLM systems. When integrating reward models into large-scale systems, this benchmark can serve as a diagnostic tool to identify and remediate specific performance bottlenecks.
Iterative Model Training: The benchmark's diverse domains indicate that optimizing reward signals might require tailored loss functions or separate sub-modules for reasoning and safety tasks. Techniques such as multi-task learning or domain-specific fine-tuning could be critical next steps.
Scaling Considerations: While increased model parameters generally lead to improved performance, the non-linear enhancements across categories necessitate careful trade-off analyses. Resource allocation should consider that improved generation quality in high-dimensional settings may not trivially translate into proportional gains in reasoning or safety evaluations.
Deployment Strategies: For production VLM deployments, the benchmark underscores the importance of continuous monitoring using detailed testing in safety-critical domains. Post-deployment auditing mechanisms should leverage benchmarks like Multimodal RewardBench to mitigate risks associated with biased or toxic outputs.

Future Directions

The benchmark provides a robust framework for future research, particularly in exploring:

Modular Architectures: Investigating approaches where separate modules are responsible for knowledge, reasoning, and safety could help alleviate the uniform performance deficits observed.
Advanced Reward Modeling: Enhanced reward signals, perhaps leveraging reinforcement learning from human feedback (RLHF) tailored across the six distinct domains, could drive further improvements.
Cross-Domain Transfer: Techniques from domain adaptation and transfer learning might be employed to generalize improvements from one domain (e.g., general correctness) to more challenging areas like complex reasoning or safety.

Conclusion

"Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision LLMs" systematically addresses the multifaceted challenges of evaluating VLM reward models by establishing a benchmark that combines detailed human annotations with diverse performance metrics. The comprehensive nature of the evaluation highlights clear strengths in current state-of-the-art systems while also identifying critical gaps in reasoning and safety. The benchmark's numerical results, particularly the 72% overall accuracy ceiling for top-performing models, serve as a benchmark for future improvements. This work thus provides a solid foundation for both refining existing reward models and guiding the development of next-generation VLMs with a focus on holistic performance metrics.