Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VHELM: A Holistic Evaluation of Vision Language Models (2410.07112v2)

Published 9 Oct 2024 in cs.CV and cs.AI

Abstract: Current benchmarks for assessing vision-LLMs (VLMs) often focus on their perception or problem-solving capabilities and neglect other critical aspects such as fairness, multilinguality, or toxicity. Furthermore, they differ in their evaluation procedures and the scope of the evaluation, making it difficult to compare models. To address these issues, we extend the HELM framework to VLMs to present the Holistic Evaluation of Vision LLMs (VHELM). VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. In doing so, we produce a comprehensive, multi-dimensional view of the capabilities of the VLMs across these important factors. In addition, we standardize the standard inference parameters, methods of prompting, and evaluation metrics to enable fair comparisons across models. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast. Our initial run evaluates 22 VLMs on 21 existing datasets to provide a holistic snapshot of the models. We uncover new key findings, such as the fact that efficiency-focused models (e.g., Claude 3 Haiku or Gemini 1.5 Flash) perform significantly worse than their full models (e.g., Claude 3 Opus or Gemini 1.5 Pro) on the bias benchmark but not when evaluated on the other aspects. For transparency, we release the raw model generations and complete results on our website (https://crfm.stanford.edu/helm/vhelm/v2.0.1). VHELM is intended to be a living benchmark, and we hope to continue adding new datasets and models over time.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Tony Lee (22 papers)
  2. Haoqin Tu (25 papers)
  3. Chi Heem Wong (3 papers)
  4. Wenhao Zheng (27 papers)
  5. Yiyang Zhou (33 papers)
  6. Yifan Mai (18 papers)
  7. Josselin Somerville Roberts (5 papers)
  8. Michihiro Yasunaga (48 papers)
  9. Huaxiu Yao (103 papers)
  10. Cihang Xie (91 papers)
  11. Percy Liang (239 papers)
Citations (3)

Summary

A Holistic Examination of Vision-LLM Evaluation through VHELM

The paper presents a comprehensive framework known as VHELM (Holistic Evaluation of Vision LLMs) aimed at evaluating Vision-LLMs (VLMs) across multiple dimensions. This framework draws inspiration from the HELM framework traditionally used for LLMs, adapting it to address specific challenges inherent in the evaluation of VLMs.

Core Contributions

  1. Multidimensional Evaluation Aspects: VHELM defines nine critical aspects for evaluating VLMs: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. This approach ensures a nuanced assessment covering both technical and societal considerations.
  2. Standardized Evaluation Procedures: Through the aggregation of 21 datasets, VHELM achieves standardization in evaluation procedures. By using consistent methods of prompting and metrics, fair comparisons across 22 VLMs are facilitated.
  3. Findings and Benchmarking: Initial evaluations reveal key insights such as the inadequate performance of efficiency-focused models on bias benchmarks relative to their full counterparts, as well as discrepancies between closed-API and open-weight models.
  4. Transparency and Accessibility: The framework outputs, including model generations and detailed results, are publicly accessible, fostering an environment of transparency and reproducibility.

Strong Numerical Results and Findings

  • Visual Perception: Models demonstrated varying strengths, with some achieving high aligned scores in image captioning tasks on datasets like Flickr30k. However, disparities remain, particularly when faced with less common or out-of-distribution images.
  • Knowledge and Reasoning: The evaluation on knowledge benchmarks such as A-OKVQA, and reasoning tasks like GQA, highlighted GPT-4o’s lead in terms of win rates, although it still trails in bias assessment.
  • Bias and Fairness: The marked underperformance of efficiency-focused models on bias benchmarks underscores a critical gap in current VLM capabilities, elucidating a need for dedicated bias mitigation efforts.

Implications and Future Directions

The adoption of VHELM not only sets a precedent for comprehensive VLM assessment but also directs future research towards addressing identified weaknesses such as model biases and lack of robustness. The findings point to potential improvements in instruction fine-tuning, particularly for open-weight models, to enhance their efficacy and alignment with intended behaviors.

From a broader perspective, VHELM contributes to the ongoing dialogue on the ethical deployment of AI models. By emphasizing aspects such as fairness and toxicity, it articulates the societal impact of AI technologies, prompting developers and policymakers to consider these dimensions in the deployment of VLMs.

Conclusion

VHELM stands out as a rigorous framework that systematically evaluates VLMs against multiple dimensions, ensuring a balanced assessment of their capabilities and limitations. Despite the effective benchmarking it provides, continuous iteration and augmentation of the framework are necessary to address evolving challenges in AI model evaluation, focusing on aspects like multilingual support and safety in adversarial settings. Such advances will be pivotal in fostering the responsible development and deployment of Vision-LLMs in real-world applications.