WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences (2406.11069v1)

Published 16 Jun 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Recent breakthroughs in vision-LLMs (VLMs) emphasize the necessity of benchmarking human preferences in real-world multimodal interactions. To address this gap, we launched WildVision-Arena (WV-Arena), an online platform that collects human preferences to evaluate VLMs. We curated WV-Bench by selecting 500 high-quality samples from 8,000 user submissions in WV-Arena. WV-Bench uses GPT-4 as the judge to compare each VLM with Claude-3-Sonnet, achieving a Spearman correlation of 0.94 with the WV-Arena Elo. This significantly outperforms other benchmarks like MMVet, MMMU, and MMStar. Our comprehensive analysis of 20K real-world interactions reveals important insights into the failure cases of top-performing VLMs. For example, we find that although GPT-4V surpasses many other models like Reka-Flash, Opus, and Yi-VL-Plus in simple visual recognition and reasoning tasks, it still faces challenges with subtle contextual cues, spatial reasoning, visual imagination, and expert domain knowledge. Additionally, current VLMs exhibit issues with hallucinations and safety when intentionally provoked. We are releasing our chat and feedback data to further advance research in the field of VLMs.

PDF HTML Abstract

Evaluating Vision-LLMs in Real-World Scenarios: WildVision

The paper, "WildVision: Evaluating Vision-LLMs in the Wild with Human Preferences," introduces a comprehensive framework designed to assess the performance of Vision-LLMs (VLMs) through real-world interactions, reflecting human preferences and challenges. The authors propose two main components—WildVision-Arena (WV-Arena) and WildVision-Bench (WV-Bench)—to facilitate this evaluation. This essay provides an in-depth analysis of these contributions and their implications, backed by quantitative and qualitative findings from the paper.

Framework Overview

WildVision-Arena

WildVision-Arena is an interactive platform where users engage with over 20 VLMs through multimodal conversations. This environment uses a chatbot-style interface for users to upload images, ask questions, and receive responses from different models. Users' preferences are captured through votes, which feed into an Elo rating system to rank the models dynamically. The platform has amassed over 20,000 multi-round human-AI interactions and 8,000 votes, ensuring a robust dataset for analysis.

WildVision-Bench

To supplement the dynamic evaluations from WV-Arena, the authors curate a static benchmark, WildVision-Bench, comprising 500 high-quality samples from the arena. This benchmark leverages GPT-4 as the judge to compare responses against the Claude-3-Sonnet model. The results show a high Spearman correlation (0.94) with the arena's Elo ratings, validating its alignment with human preferences.

Detailed Analysis

Human Preferences and Model Performance

The paper thoroughly examines the collected interactions, identifying critical insights into the performance and limitations of current VLMs. Notably, it highlights that while models like GPT-4V excel in basic visual recognition and reasoning tasks, they often struggle with contextual subtleties, spatial reasoning, and domain-specific knowledge. Issues of hallucinations and safety, particularly when models are intentionally provoked, are also prevalent.

Model Ranking and Elo System

The ranking system in WV-Arena adopts the Elo Rating system, which is well-suited for continuous and comparative evaluations. The statistical estimation method using the Bradley–Terry model provides stable rankings despite the fluctuating nature of user interactions. The results show GPT-4o leading the rankings with a significant margin, followed by GPT-4V and other models like Reka-Flash and Claude-3-Opus.

Evaluation Metrics and Alignment

The authors employ automatic evaluations using GPT-4o on WV-Bench to ensure fast and consistent assessments. These evaluations achieve close alignment with human preferences, as evidenced by the high Spearman correlation. Moreover, the comprehensive analysis involves visualizing model performance across different question categories and image domains, providing granular insights into model strengths and weaknesses.

Practical and Theoretical Implications

Real-World Applicability

The framework's real-world applicability is a significant stride towards understanding how VLMs perform outside controlled environments. By using a diverse range of user inputs and real-world images, the paper provides a more realistic evaluation of model capabilities. This approach bridges the gap between laboratory benchmarks and everyday use cases, offering valuable insights for both development and deployment.

Future Directions

In terms of future developments, the research emphasizes enhancing model robustness in handling complex visual and contextual information. Given the frequent failures in expert domain knowledge and hallucination issues, future work may focus on integrating more sophisticated reasoning and safety mechanisms into VLMs. Moreover, expanding the scope of evaluations to include high-resolution, multi-image, and extended context scenarios can further enrich the assessment framework.

Conclusion

The paper presents a comprehensive methodology for evaluating VLMs through real-world scenarios and human preferences. WildVision-Arena and WildVision-Bench together offer a dynamic and static evaluation environment, respectively, ensuring robust and human-aligned performance assessments. The extensive data analysis and transparent reporting of model limitations provide actionable insights for future research and development in the field of vision-language processing. As these evaluation frameworks evolve, they promise to significantly advance the understanding and improvement of VLMs, aligning them more closely with real-world applications and human expectations.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Yujie Lu (42 papers)
Dongfu Jiang (14 papers)
Wenhu Chen (134 papers)
William Yang Wang (254 papers)
Yejin Choi (287 papers)
Bill Yuchen Lin (72 papers)

Citations (12)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/billyuchenlin/status/1803849960292782468

https://twitter.com/yujielu_10/status/1839354279280492553

https://twitter.com/billyuchenlin/status/1803500499095920957

https://twitter.com/javaeeeee1/status/1803019658229846483