MMBench: Multimodal Evaluation Benchmark
- MMBench is a family of multimodal evaluation benchmarks that objectively assess fine-grained perception and reasoning in vision-language models.
- It employs a bilingual, multiple-choice framework with strict circular evaluation to minimize biases and ensure reproducible results.
- Extensions like MMBench-Video and GMAI-MMBench target domain-specific tasks, incorporating specialized metrics and protocols for diverse applications.
MMBench
MMBench is a family of multimodal evaluation benchmarks and protocols targeting fine-grained, reproducible, and scalable assessment of large vision-LLMs (VLMs), vision-language large models (LVLMs), multimodal LLMs (MLLMs), and multimodal neural network systems across a spectrum of tasks. It is widely referenced as a standard for ability-centric evaluation of perception and reasoning in models handling images, video, language, and domain-specific content. While several benchmarks share the MMBench name, the foundational benchmark is “MMBench: Is Your Multi-modal Model an All-around Player?” (Liu et al., 2023), which established a rigorous, large-scale, multiple-choice bilingual protocol. Subsequently, the MMBench paradigm has been extended to specialized domains including medical imaging (GMAI-MMBench (Chen et al., 2024)), human-centric video (HV-MMBench (Cai et al., 7 Jul 2025)), creative intelligence (Creation-MMBench (Fang et al., 18 Mar 2025)), GUI automation (MMBench-GUI (Wang et al., 25 Jul 2025)), and long-form video (MMBench-Video (Fang et al., 2024)), as well as serving as a canonical benchmark for the evaluation of model compression approaches (Zhang et al., 9 Dec 2025), locality-preserving architectures (Cha et al., 2023), and system-level profiling (Xu et al., 2022).
1. Core MMBench: General Benchmark for Image-Based Multimodal Reasoning
The canonical MMBench (Liu et al., 2023) provides a comprehensive, objective, and bilingual (English, Chinese) multiple-choice framework for evaluating large-scale VLMs. The benchmark addresses limitations of prior quantitative tasks (e.g., VQAv2, COCO Caption) and subjective small-scale human-in-the-loop protocols (e.g., OwlEval) by introducing the following structure:
- Scope and Coverage: 3,217 questions, balanced across 20 Level-3 (L3) abilities spanning two Level-1 domains (Perception, Reasoning) and six Level-2 skills (e.g., Coarse Perception, Attribute Reasoning, Logical Reasoning).
- Bilinguality: All prompts, choices, and instructions are verified in both English and Chinese to facilitate direct evaluation of language-dependent reasoning.
- Task Structure: Each item presents an image, a multimodal prompt, and four candidate answers; only one is correct. Question categories probe abilities including object localization, fine-grained attribute identification, relation reasoning, OCR, and knowledge-intensive tasks.
- Quality Control: Strict filtering against “text-only-answerable” questions via majority voting among GPT-4, Gemini-Pro, and Qwen-Max with text-only inputs; explicit elimination of ambiguous or broken items through cross-model validation with top-5 VLMs.
- Evaluation Protocol: Introduction of “CircularEval”: choice labels are randomly rotated per prompt, and models are required to achieve answer consistency under all rotations. This counters label bias and random guessing; final accuracy is the strict fraction of items correct in all rotations.
- LLM-Based Choice Extraction: Model outputs are mapped to choice labels via regex heuristics, and when ambiguous, an LLM (GPT-4 or equivalent) is tasked with aligning free-form answers to the closest candidate response with ≥91% agreement to human annotation.
2. Extensions and Domain-Specific Benchmarks
MMBench methodology has seeded a series of domain-specific and task-specialized evaluation suites:
- MMBench-Video (Fang et al., 2024): Extends the core protocol to long-form video understanding, utilizing 609 YouTube videos (30 s–6 min, avg. 165 s) and ≈2,000 human-authored, ability-annotated QA pairs. Capabilities are expanded from core perception/reasoning to include video-specific leaf skills (e.g., temporal reasoning, hallucination detection, commonsense) under a 26-leaf taxonomy. Scoring uses GPT-4-based semantic grading (0–3), yielding human-aligned, robust measurement of both open-source video-LLMs and static image LVLMs over temporally indispensable tasks.
- GMAI-MMBench (Chen et al., 2024): Comprehensive VQA-style medical benchmark spanning 285 datasets, 39 image modalities, 18 tasks, 18 clinical departments, and 4 perceptual granularities (image, box, mask, contour). Incorporates a lexical tree structure for customizable targeted evaluation. Covers single- and multi-choice medical QA, probing both perceptual and reasoning deficiencies in state-of-the-art LVLMs and medical-specialist models.
- HV-MMBench (Cai et al., 7 Jul 2025): Human-centric video suite comprising 1,200 videos and 8,700 QA pairs over 15 tasks (e.g., age, gender, group size, emotion, action, temporal and causal reasoning). Integrates MC, TF, fill-in-blank, and open-ended causal questions, with task-specific scoring such as F1@1 (generation) and composite causal metrics based on fuzzy F1, order, and LLM semantic judgment.
- Creation-MMBench (Fang et al., 18 Mar 2025): Targets creative multimodal intelligence, presenting 765 instances over 51 image-driven creative tasks grouped into literary, functional, professional, and creative multimodal understanding categories. Scoring combines unitary factuality from GPT-4o (1–10) and pairwise reward against a baseline MLLM, allowing fine-grained analysis of both creative response quality and factual consistency.
- MMBench-GUI (Wang et al., 25 Jul 2025): Evaluates GUI automation agents in four ascending levels (content perception, element grounding, task automation, task collaboration) across Windows, macOS, Linux, Android, iOS, and Web. Proposes the Efficiency-Quality Area (EQA) metric, jointly rewarding task success and efficiency (early completion) with clear normalization and derived metrics. Integrates modular frameworks for visual grounding, planning, action abstraction, and cross-platform consistency.
3. Evaluation Methodologies and Metrics
MMBench benchmarks are characterized by several methodological innovations and rigorous evaluation metrics:
- Multiple-Choice (MC) with Automated Label Extraction: All core and extension tasks employ MC QA; for open-ended VLM outputs, a robust mapping process involving LLMs standardizes evaluation.
- CircularEval Protocol: Models must answer correctly under all permutations of choice order, reducing spurious score inflation from positional or random guess bias.
- Ability Taxonomies: Fine-grained, hierarchical breakdown (L1/L2/L3) of abilities allows for analysis of specific model strengths and weaknesses (e.g., logic reasoning, physical relation, object localization).
- Semantic Grading: In video and creative branches, non-binary semantic similarity scores are employed (e.g., [0–3] for MMBench-Video) to better reflect answer quality and reasoning alignment.
- Derived and Composite Metrics: In GUI and medical benchmarks, additional metrics assess efficiency (EQA), multi-choice precision/recall, beyond-accuracy objectives (coverage, cold-start), and composite scores (e.g., weighted causal reasoning in HV-MMBench).
4. Key Empirical Insights and Model Analyses
Extensive evaluation across >50 models in diverse domains has yielded several consensus findings:
- Generalization and Task-Specific Weaknesses: Models show high accuracy on coarse perception and attribute reasoning (image-level, common concepts) but markedly lower performance on logical/structural (e.g., diagram/code, relation reasoning, temporal tasks) and fine-grained spatial abilities (sub-50% for 3D position, box-level in GMAI-MMBench).
- Instruction Following and Data Biases: Strong performance correlates with robust instruction-following in LLM backbones; instruction extractor choice (e.g., GPT-4) elevates VLMs with weak output formatting by up to 8 absolute points.
- Domain-Specific Gaps: In medical and professional creative tasks, both open-source and specialized models underperform, with accuracy at or below random baselines in some complex reasoning categories (e.g., Severity Grading, novel diagram analysis).
- Compression and Efficiency: Token-compressed models (HybridToken-VLM (Zhang et al., 9 Dec 2025)) achieve near-full retention on MMBench under extreme compression (e.g., HTC-VLM: 90.4% retention at 580:1) when hybrid discrete-continuous tokenization is applied.
- Locality and Abstraction: Preservation of local visual context by projectors (Honeybee (Cha et al., 2023)) directly increases spatial reasoning accuracy on MMBench, especially on finely differentiated sub-tasks.
5. Software, Integration, and Reproducibility
Broad adoption is facilitated by publicly released, open-source implementations and configuration protocols:
- VLMEvalKit Integration: Standard evaluation pipelines for both the core and several extended MMBench versions are available in VLMEvalKit (github.com/open-compass/VLMEvalKit). YAML-based configuration files specify models, backbones, evaluation strategies (e.g., circular), and LLM extractors; batch and distributed evaluation are supported.
- Declarative Workflows: ViLLA-MMBench (Nazary et al., 6 Aug 2025) introduces single-file, declarative pipelines for large-scale recommendation experiments, spanning feature extraction, fusion, and metric-computation.
- Customizable Lexical Trees: GMAI-MMBench supports surgical selection of evaluation subsets (by department, modality, task, granularity) via lexical tree traversal; enabling targeted clinical performance audits.
- Dataset and Code Availability: All major branches publicly host datasets, QA prompts, and full evaluation code (see respective GitHub repositories).
6. Impact, Limitations, and Future Directions
MMBench benchmarks have catalyzed systematic progress in multimodal model development, fine-grained diagnosis, and hardware-software co-design (Xu et al., 2022). They expose persistent domains (e.g., medical, creative, temporal) and ability gaps, guide design of new architectural components (e.g., hybrid token compressors, spatially-aware projectors), and shape standard evaluation protocols (semantic scoring, circular consistency).
Identified limitations include: domain/data coverage gaps (e.g., more challenging or rare modality combinations), persistent weaknesses in open-ended and generative responses outside MC/TF/FIB format, and ongoing domain-specialist model bottlenecks. Proposed improvements focus on expanding human-expert-labeled gold standards, incorporating multi-hop, multi-modal reasoning protocols, and deeper integration with real-world application workflows (interactive agents, adaptive tutoring, longitudinal medical prediction).
Selected References:
- MMBench: Is Your Multi-modal Model an All-around Player? (Liu et al., 2023)
- MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding (Fang et al., 2024)
- Honeybee: Locality-enhanced Projector for Multimodal LLM (Cha et al., 2023)
- HybridToken-VLM: Hybrid Token Compression for Vision-LLMs (Zhang et al., 9 Dec 2025)
- GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI (Chen et al., 2024)
- HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding (Cai et al., 7 Jul 2025)
- Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM (Fang et al., 18 Mar 2025)
- MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents (Wang et al., 25 Jul 2025)
- MMBench: Benchmarking End-to-End Multi-modal DNNs and Understanding Their Hardware-Software Implications (Xu et al., 2022)
- ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation (Nazary et al., 6 Aug 2025)