- The paper introduces VLMEvalKit, a unified toolkit that automates data preparation, distributed inference, and metric calculation for evaluating LMMs.
- The toolkit supports over 70 models and 20 benchmarks, revealing that commercial APIs often outperform open-source models in complex, knowledge-intensive tasks.
- VLMEvalKit provides a modular, reproducible framework that simplifies integrating new benchmarks and modalities, advancing multi-modal AI research.
The paper "VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models" introduces VLMEvalKit, a comprehensive and user-friendly toolkit for assessing the performance of Large Multi-Modality Models (LMMs). The toolkit is based on PyTorch and is freely available on GitHub under the Apache 2.0 License. The authors, affiliated with various renowned institutions including the Shanghai AI Laboratory and CUHK, aim to aid researchers and developers in evaluating and publishing reproducible results for LMMs.
Objectives and Contributions
The primary objective of VLMEvalKit is to streamline the evaluation process for LMMs across a broad range of benchmarks and tasks. The toolkit currently supports over 70 different models—comprising both proprietary APIs and open-source models—and more than 20 multi-modal benchmarks. This extensive support is facilitated through a straightforward interface that simplifies the addition of new models and benchmarks.
Key contributions of VLMEvalKit include:
- Providing a unified interface for model integration.
- Automating tasks such as data preparation, distributed inference, prediction post-processing, and metric calculation.
- Maintaining a comprehensive leaderboard, OpenVLM Leaderboard, to track progress in the field.
VLMEvalKit's design is modular, consisting of several key components:
- Benchmarks: Evaluation samples categorized by task and scenario are prepared in
.tsv
files, supporting straightforward prompt construction and interleaved multi-modal messages.
- LMMs: A unified
.generate()
interface is implemented for all LMMs, accommodating both single and multi-modal inputs.
- Multi-modal Inference: Inference processes are parallelized to use multiple GPUs or multiprocessing for APIs, enhancing efficiency and robustness.
- Multi-modal Evaluation: Various benchmarks are supported, including multi-choice questions (MCQ), yes-or-no questions (Y/N), and open-ended formats. VLMEvalKit employs both exact matching and LLM-augmented answer extraction to enhance evaluation accuracy and reduce variability.
Evaluation Results
The toolkit's utility is demonstrated through comprehensive benchmarking of LMMs, the results of which are published on the OpenVLM Leaderboard. Evaluation is based on a suite of eight key benchmarks, covering diverse domains such as all-round capability, data contamination, multi-modal examination, and subjective evaluation. Notable findings include:
- Performance of Commercial APIs: Proprietary models exhibit a significant performance advantage, with top-performing APIs like GPT-4o achieving notably higher average scores compared to open-source counterparts. API models perform especially well in benchmarks requiring extensive knowledge and subjective evaluation.
- Performance of Open-source LMMs: While open-source models generally lag behind APIs, top models still demonstrate robust capabilities. InternVL-Chat-V1.5 and LLaVA-NeXT-Yi-34B stand out, excelling in comprehensive and knowledge-intensive benchmarks.
Implications and Future Work
VLMEvalKit offers both practical and theoretical benefits. Practically, it reduces the barrier to comprehensive LMM evaluation, aiding small research teams in conducting robust assessments. Theoretically, it provides a standardized framework that ensures reproducibility and comparability in LMM research. The toolkit's design is modular and extensible, ensuring its relevance as new modalities, such as audio and video, are integrated.
Future developments of VLMEvalKit will focus on expanding support for additional modalities, with particular emphasis on video understanding. The authors are committed to continuously updating the toolkit, integrating new models and benchmarks, and refining evaluation methodologies to keep pace with advances in multi-modal learning.
Conclusion
The release of VLMEvalKit marks a significant step forward in the evaluation of Large Multi-Modality Models. By offering an open-source, comprehensive framework, the toolkit stands to substantially enhance research capabilities in the field. As multi-modal learning continues to evolve, VLMEvalKit is well-positioned to serve as a pivotal resource for researchers and developers aiming to push the boundaries of AI.