VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models (2407.11691v3)

Published 16 Jul 2024 in cs.CV

Abstract: We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-LLMs, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at https://github.com/open-compass/VLMEvalKit and is actively maintained.

Citations (37)

View on Semantic Scholar

Summary

The paper introduces VLMEvalKit, a unified toolkit that automates data preparation, distributed inference, and metric calculation for evaluating LMMs.
The toolkit supports over 70 models and 20 benchmarks, revealing that commercial APIs often outperform open-source models in complex, knowledge-intensive tasks.
VLMEvalKit provides a modular, reproducible framework that simplifies integrating new benchmarks and modalities, advancing multi-modal AI research.

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

The paper "VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models" introduces VLMEvalKit, a comprehensive and user-friendly toolkit for assessing the performance of Large Multi-Modality Models (LMMs). The toolkit is based on PyTorch and is freely available on GitHub under the Apache 2.0 License. The authors, affiliated with various renowned institutions including the Shanghai AI Laboratory and CUHK, aim to aid researchers and developers in evaluating and publishing reproducible results for LMMs.

Objectives and Contributions

The primary objective of VLMEvalKit is to streamline the evaluation process for LMMs across a broad range of benchmarks and tasks. The toolkit currently supports over 70 different models—comprising both proprietary APIs and open-source models—and more than 20 multi-modal benchmarks. This extensive support is facilitated through a straightforward interface that simplifies the addition of new models and benchmarks.

Key contributions of VLMEvalKit include:

Providing a unified interface for model integration.
Automating tasks such as data preparation, distributed inference, prediction post-processing, and metric calculation.
Maintaining a comprehensive leaderboard, OpenVLM Leaderboard, to track progress in the field.

Toolkit Design

VLMEvalKit's design is modular, consisting of several key components:

Benchmarks: Evaluation samples categorized by task and scenario are prepared in .tsv files, supporting straightforward prompt construction and interleaved multi-modal messages.
LMMs: A unified .generate() interface is implemented for all LMMs, accommodating both single and multi-modal inputs.
Multi-modal Inference: Inference processes are parallelized to use multiple GPUs or multiprocessing for APIs, enhancing efficiency and robustness.
Multi-modal Evaluation: Various benchmarks are supported, including multi-choice questions (MCQ), yes-or-no questions (Y/N), and open-ended formats. VLMEvalKit employs both exact matching and LLM-augmented answer extraction to enhance evaluation accuracy and reduce variability.

Evaluation Results

The toolkit's utility is demonstrated through comprehensive benchmarking of LMMs, the results of which are published on the OpenVLM Leaderboard. Evaluation is based on a suite of eight key benchmarks, covering diverse domains such as all-round capability, data contamination, multi-modal examination, and subjective evaluation. Notable findings include:

Performance of Commercial APIs: Proprietary models exhibit a significant performance advantage, with top-performing APIs like GPT-4o achieving notably higher average scores compared to open-source counterparts. API models perform especially well in benchmarks requiring extensive knowledge and subjective evaluation.
Performance of Open-source LMMs: While open-source models generally lag behind APIs, top models still demonstrate robust capabilities. InternVL-Chat-V1.5 and LLaVA-NeXT-Yi-34B stand out, excelling in comprehensive and knowledge-intensive benchmarks.

Implications and Future Work

VLMEvalKit offers both practical and theoretical benefits. Practically, it reduces the barrier to comprehensive LMM evaluation, aiding small research teams in conducting robust assessments. Theoretically, it provides a standardized framework that ensures reproducibility and comparability in LMM research. The toolkit's design is modular and extensible, ensuring its relevance as new modalities, such as audio and video, are integrated.

Future developments of VLMEvalKit will focus on expanding support for additional modalities, with particular emphasis on video understanding. The authors are committed to continuously updating the toolkit, integrating new models and benchmarks, and refining evaluation methodologies to keep pace with advances in multi-modal learning.

Conclusion

The release of VLMEvalKit marks a significant step forward in the evaluation of Large Multi-Modality Models. By offering an open-source, comprehensive framework, the toolkit stands to substantially enhance research capabilities in the field. As multi-modal learning continues to evolve, VLMEvalKit is well-positioned to serve as a pivotal resource for researchers and developers aiming to push the boundaries of AI.

PDF Markdown

Related Papers

GitHub

GitHub - open-compass/VLMEvalKit: Open-source evaluation toolkit of large vision-language models (LVLMs), support GPT-4v, Gemini, QwenVLPlus, 50+ HF models, 20+ benchmarks (1,281 stars)