NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples (2410.14669v2)

Published 18 Oct 2024 in cs.CV and cs.CL

Abstract: Vision-LLMs (VLMs) have made significant progress in recent visual-question-answering (VQA) benchmarks that evaluate complex visio-linguistic reasoning. However, are these models truly effective? In this work, we show that VLMs still struggle with natural images and questions that humans can easily answer, which we term natural adversarial samples. We also find it surprisingly easy to generate these VQA samples from natural image-text corpora using off-the-shelf models like CLIP and ChatGPT. We propose a semi-automated approach to collect a new benchmark, NaturalBench, for reliably evaluating VLMs with 10,000 human-verified VQA samples. Crucially, we adopt a $\textbf{vision-centric}$ design by pairing each question with two images that yield different answers, preventing blind solutions from answering without using the images. This makes NaturalBench more challenging than previous benchmarks that can be solved with commonsense priors. We evaluate 53 state-of-the-art VLMs on NaturalBench, showing that models like LLaVA-OneVision, Cambrian-1, Llama3.2-Vision, Molmo, Qwen2-VL, and even GPT-4o lag 50%-70% behind human performance (over 90%). We analyze why NaturalBench is hard from two angles: (1) Compositionality: Solving NaturalBench requires diverse visio-linguistic skills, including understanding attribute bindings, object relationships, and advanced reasoning like logic and counting. To this end, unlike prior work that uses a single tag per sample, we tag each NaturalBench sample with 1 to 8 skill tags for fine-grained evaluation. (2) Biases: NaturalBench exposes severe biases in VLMs, as models often choose the same answer regardless of the image. Lastly, we apply our benchmark curation method to diverse data sources, including long captions (over 100 words) and non-English languages like Chinese and Hindi, highlighting its potential for dynamic evaluations of VLMs.

PDF HTML Abstract

Overview of NaturalBench: Evaluating Vision-LLMs on Natural Adversarial Samples

This paper introduces NaturalBench, a benchmark designed to evaluate vision-LLMs (VLMs) on what the authors term natural adversarial samples. The foundation of this work rests on the observation that even state-of-the-art VLMs struggle with basic visio-linguistic tasks when confronted with these natural adversarial instances. This finding prompts a critical reevaluation of the actual capabilities of VLMs.

Key Contributions

The paper's central contribution is the proposal of NaturalBench, a benchmark crafted using a semi-automated process that reduces human curation effort. This benchmark contains 10,000 VQA samples, each meticulously verified by humans to ensure accuracy and relevance. NaturalBench uniquely challenges models by pairing each question with two images that result in different answers, compelling the models to utilize visual information effectively and averting reliance on language biases or shortcuts.

Methodology

The collection process employs a novel approach, integrating off-the-shelf models like CLIP and ChatGPT to generate VQA samples from existing image-text datasets such as Flickr30K and DOCCI. Importantly, these images are paired based on a semi-automated identification of confounding pairs that current VLMs find challenging, ensuring the benchmark's difficulty aligns with the models' current limitations.

Evaluation and Findings

The authors evaluate 53 state-of-the-art VLMs across NaturalBench and uncover a significant performance gap between models and human participants, with models lagging by 50%-70% in accuracy. The analysis reveals that the models' difficulties stem notably from biases towards certain answers and challenges in handling compositionality, including skills like attribute bindings and complex reasoning.

Implications and Future Developments

The implications of NaturalBench are profound. Practically, it acts as a testbed for enhancing VLMs by highlighting specific shortcomings. Theoretically, it poses questions about the models' real understanding versus their ability to leverage language priors.

This benchmark sets the stage for potentially transformative changes, encouraging the development of models that are less reliant on language shortcuts and more robustly grounded in visual context. Future advancements might involve dynamic benchmarking protocols to accommodate ongoing model evolutions, ensuring benchmarks remain challenging and relevant.

Conclusion

The development of NaturalBench marks a significant stride in vision-language evaluation. By emphasizing a balanced, compositionally challenging dataset, it provides a more rigorous and nuanced measure of VLM capabilities. This benchmark not only signifies current challenges in VLM development but also highlights future avenues for research, particularly in addressing biases and enhancing compositional reasoning skills. The scalability of the benchmark collection method opens up further dynamic evaluation opportunities, driving the evolution of VLMs towards more authentic visio-linguistic comprehension.