Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models (2411.17451v2)

Published 26 Nov 2024 in cs.CV and cs.CL

Abstract: Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline that combines sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe VL-GenRMs limitations. Comprehensive evaluation across 16 leading large vision-LLMs demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r $>$ 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.

Citations (1)

Summary

  • The paper introduces VLRewardBench, a comprehensive benchmark with 1,250 curated examples to rigorously evaluate vision-language generative reward models.
  • It reveals significant limitations in current models, with GPT-4o achieving only 65.4% accuracy and open-source models performing near chance levels.
  • The benchmark shows a strong correlation (Pearson's r > 0.9) with task efficiency, underscoring its practical value for advancing multimodal AI research.

Analyzing VL-RewardBench: A Benchmark for Evaluating Vision-Language Generative Reward Models

The paper "VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models" addresses the critical gap in evaluating vision-language generative reward models (VL-GenRMs). With the increasing deployment of Large Vision-LLMs (LVLMs) as generative reward models to automatically assess model responses, reliable benchmarks become indispensable for ensuring model alignment and advancing multimodal AI systems. This paper introduces VL-RewardBench as a comprehensive evaluation tool designed to challenge existing VL-GenRMs and expose shortcomings in their capability to process and align multimodal information.

Core Contributions and Findings

VL-RewardBench is structured around the need to provide a rigorous test for VL-GenRMs. It comprises 1,250 curated examples that aim to circumvent limitations within existing assessments, which often rely on AI-annotated labels that are prone to bias. The authors highlight the shortcomings of current models by evaluating them across a range of tasks divided into general multimodal queries, visual hallucination detection, and complex reasoning challenges.

The experimental results from the benchmark underscore several noteworthy findings:

  • Even advanced models face significant challenges with the benchmark, as exemplified by GPT-4o obtaining only 65.4% accuracy.
  • Open-source models, such as Qwen2-VL-72B, perform near chance levels, which indicates the demanding nature of the benchmark tasks.
  • A strong correlation (Pearson's r > 0.9) between performance on VL-RewardBench and task efficiency under scenarios like Best-of-N sampling suggests robust applicability of the benchmark in real-world evaluation scenarios.

Implications and Future Directions

The implications of VL-RewardBench extend into both the practical and theoretical domains of AI development. Practically, VL-RewardBench provides a reliable metric for assessing the efficacy of VL-GenRMs across varying challenges, thus ensuring their real-world application is robust and reliable. Theoretically, it highlights the critical areas VL-GenRMs need to improve upon, particularly in perception tasks rather than reasoning, as visual perception errors were more frequent.

The research suggests several future pathways:

  • Enhancing models' capacity to judge attributes and object existence through improved training methodologies and focus on bias reduction.
  • Investigating model architecture that can more effectively leverage inference-time scaling, as traditional improvements in this area yield mixed results.
  • Leveraging critic training to boost models' judgment accuracy, as shown by the 14.7% performance gain exhibited by LLaVA-OneVision-7B with such enhancements.

Conclusion

VL-RewardBench stands as a valuable tool for the vision-language research community, providing a rigorous environment to vet current VL-GenRMs and paving the way for the development of more resilient models. The benchmark emphasizes the complex interplay between visual perception and reasoning tasks, pushing the boundary for what is considered adequate model performance. Continued refinement and utilization of comprehensive benchmarks like VL-RewardBench will prove crucial in developing the next generation of vision-LLMs capable of sophisticated understanding and reasoning across multimodal inputs.