LMMs-Eval Lite: Efficient Multimodal Benchmarking
- LMMs-Eval Lite is a pruned evaluation toolkit that benchmarks large multimodal models using a balanced, low-cost approach with broad coverage across vision-language tasks.
- It employs k-center clustering in a joint image-text embedding space, using a greedy algorithm for representative subset selection that approximates full dataset scores with high fidelity.
- The toolkit supports rapid, scalable evaluations while preserving diagnostic integrity and mitigating contamination through dynamic complementing via LiveBench.
LMMs-Eval Lite is a pruned evaluation toolkit and methodology designed for efficient, comprehensive, and contamination-resistant benchmarking of large multimodal models (LMMs) across vision-language tasks. Developed as a component of the LMMS-EVAL suite, LMMs-Eval Lite combines broad evaluation coverage with reduced computational cost and minimized data contamination, providing a practical solution to the “evaluation trilemma”—balancing wide coverage, low cost, and zero contamination (Zhang et al., 17 Jul 2024).
1. Motivation and Core Principles
LMMs-Eval Lite was introduced to address three interconnected requirements universally challenging in LMM evaluation:
- Wide Coverage: Ensuring the benchmark includes diverse tasks and domains (e.g., document/infographic understanding, captioning, VQA, math, multi-discipline settings).
- Low Evaluation Cost: Achieving rapid, affordable evaluation by drastically reducing the number of evaluation samples required per task.
- Zero Contamination: Mitigating the risk that evaluation data overlap with LMMs’ training corpora—an especially acute problem in large, static, widely reused datasets.
The methodology underpinning LMMs-Eval Lite formalizes the following problem. For a model and dataset with scoring function , the objective is to select a subset that achieves:
Solving this problem exactly is NP-hard (the -Center problem); LMMs-Eval Lite employs a greedy algorithm giving a 2-optimal approximation.
2. Subset Selection Algorithm: k-Center Clustering
LMMs-Eval Lite’s data reduction relies on k-center clustering in a joint image-text embedding space:
- Embedding Construction: Each sample is embedded by concatenating its image features (e.g., CLIP) and textual features (e.g., BGE-M3).
- Clustering Criterion: The k-center criterion seeks a representative, maximally diverse subset such that every sample in is close to some member of in the embedding space.
- Algorithm: A greedy selection procedure iteratively adds the sample with the greatest minimal distance to existing centers, yielding a subset that covers the distributional space of the full dataset with high fidelity.
This method ensures effective preservation of per-task and model-level rank correlations (empirically exceeding 0.9 in many settings between full and Lite scores).
3. Benchmark Structure and Aggregated Scoring
LMMs-Eval Lite is built as a slice of the full LMMS-EVAL suite, encompassing:
- Over 50 tasks spanning document understanding, image captioning (Flickr30K, NoCaps, TextCaps, RefCOCO), visual question answering (TextVQA), math and science reasoning (MathVista, AI2D), and multi-discipline benchmarks (MME, MMMU, Seed-Bench).
- Pruned subsets of benchmark data (hundreds of instances per dataset), maintaining distributional representation and score reliability.
Each constituent dataset may use a different metric (e.g., CIDEr for captioning, accuracy for VQA, ANLS for DocVQA). LMMs-Eval Lite normalizes all metrics to a [0, 100] scale and computes a weighted average as the aggregate score, allowing direct comparison and facilitating model development and ablation studies.
Dataset Type | Metric | Normalization |
---|---|---|
Image Captioning | CIDEr | rescaled to [0, 100] |
Visual QA | Accuracy | percentage scaled [0, 100] |
Document QA | ANLS | rescaled to [0, 100] |
4. Mitigating Contamination and Representativeness
Full-scale benchmarks risk contamination, especially when datasets are static and widely available. LMMs-Eval Lite addresses this by:
- Representative Subset Selection: Only a minimal, diverse sample of each data source is used.
- Algorithmic Guidance: Selection algorithms maximize the proxy fidelity of Lite scores to full set scores by design.
- Complementation by LiveBench: Multimodal LiveBench, a component of LMMS-EVAL, assesses zero-shot generalization using continuously updated data (news, forums). This nearly eliminates contamination, as the content is acquired and annotated in real time.
The LiveBench process includes:
- Automated data pipeline: Captures screenshots from trusted news/websites, pre-processes them.
- Autonomous question generation: LMM-based quiz models (e.g., GPT-4-Vision) create multidimensional queries.
- Human verification + LMM-based grading: Human annotation and judge models score per sample, scaled to [0, 100].
5. Evaluation Efficiency and Scalability
LMMs-Eval Lite is engineered for practical, rapid use throughout model development cycles:
- Cost Reduction: Subset selection reduces the size and computational footprint of each evaluation run.
- Ranking Robustness: Empirical results confirm that relative model orderings, aggregated scores, and per-task diagnostics from Lite closely mirror those obtained from full-scale benchmarks.
- Scalability: The modular, per-benchmark structure and subset selection support scalable, parallel evaluations across many tasks and models.
This design enables frequent, low-cost assessments without sacrificing diagnostic granularity or inviting contamination artifacts.
6. Significance and Implications
LMMs-Eval Lite enables researchers to:
- Obtain wide-coverage, statistically representative benchmarking of LMMs at a fraction of the computational cost.
- Avoid misleading results due to training-test data contamination.
- Rely on robust, standardized score aggregation across heterogeneous tasks and data types.
- Incorporate dynamic, real-world generalization testing via LiveBench to augment static evaluations.
These features directly support transparent leaderboard maintenance, robust model comparison, and iterative multimodal model development.
7. Future Directions and Limitations
Persisting contamination challenges imply that LMMs-Eval Lite, while highly mitigating, does not guarantee absolute zero contamination. Further improvements may include:
- Algorithmic advances in contamination detection and dataset provenance.
- Integration of adaptive evaluation protocols that automatically update benchmarks as model training corpora evolve.
- Expansion to additional modalities and real-world, in-the-wild scenarios via continual augmentation of LiveBench and similar protocols.
In sum, LMMs-Eval Lite constitutes an efficient, coverage-preserving, and contamination-resistant evaluation strategy tailored for the demands of large multimodal model development and transparent benchmarking (Zhang et al., 17 Jul 2024).