UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling (2408.04810v1)

Published 9 Aug 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Significant research efforts have been made to scale and improve vision-LLM (VLM) training approaches. Yet, with an ever-growing number of benchmarks, researchers are tasked with the heavy burden of implementing each protocol, bearing a non-trivial computational cost, and making sense of how all these benchmarks translate into meaningful axes of progress. To facilitate a systematic evaluation of VLM progress, we introduce UniBench: a unified implementation of 50+ VLM benchmarks spanning a comprehensive range of carefully categorized capabilities from object recognition to spatial awareness, counting, and much more. We showcase the utility of UniBench for measuring progress by evaluating nearly 60 publicly available vision-LLMs, trained on scales of up to 12.8B samples. We find that while scaling training data or model size can boost many vision-LLM capabilities, scaling offers little benefit for reasoning or relations. Surprisingly, we also discover today's best VLMs struggle on simple digit recognition and counting tasks, e.g. MNIST, which much simpler networks can solve. Where scale falls short, we find that more precise interventions, such as data quality or tailored-learning objectives offer more promise. For practitioners, we also offer guidance on selecting a suitable VLM for a given application. Finally, we release an easy-to-run UniBench code-base with the full set of 50+ benchmarks and comparisons across 59 models as well as a distilled, representative set of benchmarks that runs in 5 minutes on a single GPU.

PDF HTML Abstract

Insightful Overview of UniBench: A Unified Framework for Vision-LLM Evaluation

The paper "UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling," authored by Haider Al-Tahan et al., addresses critical issues in the evaluation of Vision-LLMs (VLMs). With the rapid development of VLMs capable of performing a wide array of multimodal tasks, the landscape for evaluating these models has become fragmented and computationally burdensome. The authors introduce UniBench, a unified implementation encompassing over 50 benchmarks that aim to streamline the evaluation process across various model capabilities.

Key Contributions of the Paper

Unified Benchmarking Framework

The primary contribution of the paper is the introduction of UniBench, a comprehensive benchmark suite that encapsulates more than 50 VLM benchmarks. These benchmarks span a wide range of capabilities, including object recognition, spatial awareness, reasoning, and others. The benchmarks have been categorized into seven types and further distilled into seventeen finer-grained capabilities. This systematic categorization allows researchers to pinpoint model strengths and weaknesses efficiently.

Evaluation of Nearly 60 VLMs

The authors evaluated 59 publicly available VLMs, covering various architectures, model sizes, and training dataset scales. The evaluation highlights that while scaling models and datasets can improve many aspects of VLM performance, it shows limited benefits for reasoning and relational tasks. Notably, contemporary VLMs struggle with basic tasks like digit recognition and counting, which simpler networks can solve easily.

Practical Implications of Scaling

One of the significant findings from the evaluations is that scaling either the model size or the training dataset offers diminishing returns for certain capabilities, particularly reasoning and relational understanding. For instance, scaling training data size up to 12.8B samples or model size up to 1B parameters does not substantially enhance performance on visual relations and reasoning benchmarks.

Recommendations for Practitioners

The paper offers practical recommendations for selecting suitable VLMs depending on the application. For general-purpose tasks, large models like Eva ViT-E/14 are suggested, whereas models like NegCLIP, which utilize tailored learning objectives involving hard negatives, excel in relational understanding tasks.

Numerical Results and Observations

The evaluation reveals that some top-performing models trained on 2 billion samples offer better performance than models trained on much larger datasets of up to 12.8 billion samples.
Despite the availability of substantial training data, many VLMs perform poorly on benchmarks like MNIST and SVHN. For example, simple 2-layer MLPs achieve 99% accuracy on MNIST, outperforming sophisticated VLMs.
Correlation analysis shows that ImageNet performance does not universally translate to proficiency across all tasks, particularly for 18 of the 53 evaluated benchmarks.

Implications and Future Directions

Data Quality Over Data Quantity

The paper suggests that data quality plays a more crucial role than sheer quantity in model performance. The highest-performing models were often those trained on datasets with more stringent filtering criteria, highlighting the importance of data curation.

Tailored Learning Objectives

Custom learning objectives tailored to specific tasks offer significant improvements. For instance, NegCLIP's targeted learning objectives enable it to outperform larger models on relational tasks. This calls for rethinking model training in terms of tailored learning objectives.

Efficient and Comprehensive Evaluation

UniBench's contribution to providing a distilled, representative set of benchmarks that can run efficiently (in under 5 minutes on a single A100 GPU) democratizes comprehensive VLM evaluation. This practical approach allows for faster and more meaningful assessments of VLMs.

Conclusion

The paper by Al-Tahan et al. makes a significant contribution to the field of VLM evaluation by introducing UniBench, a unified benchmarking framework that addresses the fragmented and computationally intensive landscape of current evaluations. The comprehensive analysis of nearly 60 VLMs sheds light on the limitations of scaling and underlines the importance of data quality and tailored learning objectives. UniBench stands as a practical tool for researchers to systematically and efficiently gauge the progress of VLMs, uncovering both strengths and areas needing improvement. Moving forward, the insights from this paper can propel the development of more robust and diverse VLMs, better equipped to handle a broader range of real-world tasks.