LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Published 17 Jul 2024 in cs.CL and cs.CV | (2407.12772v2)

Abstract: The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of LLM evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

Abstract PDF HTML Upgrade to Chat

Citations (20)

View on Semantic Scholar

Summary

The paper introduces LMMs-Eval, a benchmark suite that standardizes evaluation protocols to address the trade-offs between wide coverage, low cost, and zero contamination.
It details LMMs-Eval Lite, which utilizes a k-Center subset selection strategy to deliver quick yet reliable performance insights.
It presents LiveBench, a dynamic evaluation pipeline that mitigates data contamination while integrating human judgments and machine metrics for real-world relevance.

LMMs-Eval: A Comprehensive Evaluation Framework for Large Multimodal Models

Introduction

The paper "LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models" addresses the challenges of evaluating Large Multimodal Models (LMMs). It introduces the LMMs-Eval, a benchmark suite designed for comprehensive and reproducible evaluation of such models across 50+ tasks and 10+ models. Despite advancements in foundational models, the paper highlights a persistent trilemma in LMM evaluation — achieving wide coverage, low cost, and zero contamination simultaneously. The authors propose LMMs-Eval Lite and LiveBench to navigate these constraints effectively.

Figure 1: To best navigate the trilemma in LMM evaluation benchmarking, we introduce LMMs-Eval, LMMs-Eval Lite, and LiveBench.

LMMs-Eval: A Unified Benchmarking Suite

Standardized Framework

The LMMs-Eval suite aims to unify evaluation protocols, thus reducing the inefficiencies associated with disparate model evaluation frameworks. By replicating the design of evaluation frameworks like lm-eval-harness, it standardizes data preprocessing, model inference, and evaluation score calculations, thereby facilitating transparent and reproducible evaluation practices.

Scaling Evaluations

Figure 2: Correlation Graph between scores for our lite set and original scores

The framework supports a wide array of models and datasets, enabling comprehensive evaluation across multiple dimensions. The pursuit of large-scale evaluation highlights the significance of the evaluation trilemma — asserting that only a trade-off can be achieved between wide coverage, low cost, and zero contamination.

LMMs-Eval Lite: Efficient and Broad Evaluation

Lite Benchmark Set

LMMs-Eval Lite selects a representative subset of the original dataset to provide quick and reliable insights during model development. Utilizing a variant of the $k$ -Center problem, the Lite version identifies a data subset such that the evaluated scores remain consistent with those derived from the full dataset.

Figure 3: Results of LMMs-Eval Lite across different models. The x-axis represents the weighted average percentage of scores across all datasets.

Aggregated Score Strategy

To yield an overarching signal during model iterations, LMMs-Eval Lite employs a normalization technique. Scores from each dataset are aggregated post-normalization, offering a unified metric indicative of model performance. However, it's emphasized that this metric serves primarily as a guide during model iterations rather than a definitive performance comparison across families.

LiveBench: Dynamic Evaluation for Real-World Relevance

Data Contamination Challenges

The assimilation of datasets during LMM training poses risks of data contamination, as models unknowingly leverage pre-seen benchmark data, thereby skewing results. A comprehensive analysis indicates significant contamination in widely used datasets such as ChartQA, VQAv2, and COCO2014.

Figure 4: Contamination analysis in current evaluation benchmarks and LLaVA's training data.

Live Evaluation Pipeline

LiveBench introduces a real-time evaluation methodology by curating a dynamic dataset from current news and forum websites. It emphasizes the zero-contamination objective while being resource-efficient. The evaluation metrics incorporate human judgments and diverse machine models, ensuring a holistic assessment of LMM generalization capabilities on contemporary data.

Figure 5: Overview pipeline for LiveBench.

Conclusion

The paper presents a structured approach to tackling the complexities of evaluating Large Multimodal Models through LMMs-Eval. The suite's components — LMMs-Eval, its Lite version, and LiveBench — cater to various aspects of the evaluation trilemma, offering scalable, cost-effective, and currently relevant benchmarks. Future work is anticipated to explore dissolving the limitations posed by evaluation constraints and data contamination, enhancing the fidelity and reliability of LMM assessments.

Markdown