MME Benchmark Evaluation
- MME Benchmark is a standardized evaluation framework for multimodal large language models, ensuring leakage-free testing and broad subtask coverage.
- It uses manually authored prompts across 14 subtasks (e.g., OCR, numeric reasoning) to provide fair and scalable model assessments.
- Extensions like MME-Emotion and MME-Finance highlight its adaptability, addressing domain-specific challenges and multimodal fusion limitations.
A “MME Benchmark” is a standardized evaluation framework designed to assess capabilities of multimodal LLMs (MLLMs), with “MME” standing for “Multimodal Model Evaluation.” The term is specifically associated with a series of rigorous, leakage-avoiding, and task-diverse benchmarks that originated with the introduction of the original MME benchmark for MLLMs, later spawning a large family of domain-focused and capability-focused extensions. These benchmarks are cited extensively in technical evaluation of MLLMs, covering a highly diverse range of multimodal tasks, metrics, settings, and practical domains.
1. Origins and Evolution of MME Benchmarks
The original “MME” benchmark (“A Comprehensive Evaluation Benchmark for Multimodal LLMs” (Fu et al., 2023)) was introduced to address the lack of unified, fair, and leakage-free yardsticks for evaluating the emergent abilities of MLLMs. It covers both low-level perception and high-level cognition across 14 subtasks, such as object existence, counting, color, OCR, and numeric/coding reasoning, using only manually authored question–answer pairs to avoid overlap with model pretraining data. Each subtask adopts a rigid, concise (yes/no) prompt format, ensuring uniformity and minimizing prompt-engineering bias.
The original MME framework was rapidly embraced as a standard in the field, and has since catalyzed a wave of domain- and skill-specific MME benchmarks. Examples include:
- MME-Emotion: systematic evaluation of emotional intelligence and reasoning in MLLMs over realistic video scenarios (Zhang et al., 11 Aug 2025)
- MME-SCI: comprehensive diagnosis of MLLMs’ science reasoning across five languages, three input modes, and 63 fine-grained knowledge points (Ruan et al., 19 Aug 2025)
- MME-Industry: cross-industry multimodal tasks with expert validation, measuring domain transfer in technical and practical settings (Yi et al., 28 Jan 2025)
- Video-MME: quantitative benchmarking for video understanding, integrating audio, subtitle, and long-term temporal reasoning (Fu et al., 2024)
- MME-Finance: expert-level open-ended VQA in financial visual scenarios with bilingual, real-world chart/statement inputs (Gan et al., 2024)
- Human-MME: holistic assessment of human-centric understanding from fine-grained perception to high-level causal inference in images (Liu et al., 30 Sep 2025)
- MME-RealWorld: large-scale, high-resolution challenge set focused on difficult real-world scenarios (Zhang et al., 2024)
This proliferation demonstrates the MME methodology’s adaptability and centrality as an evaluation paradigm.
2. Benchmark Design Principles and Scope
Core principles of MME benchmarks, as derived from the original blueprint and its leading descendents, are:
- Broad Subtask Coverage: Each benchmark targets a comprehensive set of abilities (e.g., recognition, reasoning, grounding, chain-of-thought generation) with minimal dependence on domain-knowledge or OCR shortcuts (Fu et al., 2023, Fu et al., 2024, Yi et al., 28 Jan 2025).
- Manual Design and Data Integrity: All prompts and answers are manually crafted and validated, with data leakage prevention procedures such as using only raw images from public sets but authoring novel questions (Fu et al., 2023, Yi et al., 28 Jan 2025).
- Uniform Prompt and Output Protocols: Strictly formatted instructions ensure fair comparisons and ease of metric collection (e.g., yes/no outputs or closed-set answer tags) (Fu et al., 2023).
- Scalability and Diversity: The latest MME benchmarks reach thousands to tens of thousands of samples (e.g., MME-RealWorld: 29,429 QAs; MME-Emotion: 6,500 QA pairs), often spanning dozens of domains, modalities (image, audio, video, text), and languages (Zhang et al., 2024, Zhang et al., 11 Aug 2025, Ruan et al., 19 Aug 2025).
- Progressive Difficulty and Annotation Granularity: Tasks range from single-step perception to multi-step reasoning and holistic causal inference, often with layered annotations (e.g., knowledge-point tagging, fine-grained evidence, spatial/temporal structure) (Ruan et al., 19 Aug 2025, Zhang et al., 11 Aug 2025, Liu et al., 30 Sep 2025).
- Leakage Control & Fairness: By eschewing repurposed test questions from popular datasets, MME benchmarks avoid overestimation due to pretraining exposure (Fu et al., 2023).
3. Task Structure and Subdomain Specialization
Each member of the MME benchmark family is characterized by its selection of modalities, evaluation axes, and domain focus. For example:
- MME-Emotion: Eight emotion-centric video tasks covering controlled, wild, noisy conditions, fine- or multi-label recognition, sentiment analysis, and intent (Zhang et al., 11 Aug 2025).
- MME-SCI: Science-knowledge benchmarks in five languages, leveraging text-only, image-only, and hybrid modes, with 63 knowledge-point granularity (Ruan et al., 19 Aug 2025).
- MME-Industry: 21 sectors from electronics to medical, with 50 hand-crafted visual multiple-choice questions per domain (Yi et al., 28 Jan 2025).
- MME-RealWorld: Five high-difficulty scenarios (OCR in the wild, diagrams and tables, remote sensing, autonomous driving, video monitoring), >13,000 unique high-resolution images (Zhang et al., 2024).
- Human-MME: Eight progressive dimensions from pose and attribute grounding to intention/causal/emotion discrimination, using ~20,000 curated QA pairs (Liu et al., 30 Sep 2025).
- MME-Finance: Bilingual, expert-curated, open-ended evaluation on financial charts, tables, and photos, with multi-level reasoning from OCR to subjective investment advice (Gan et al., 2024).
- Video-MME: 900 videos (254 hours), 2,700 multi-step video QA pairs, 12 tasks including temporal, spatial, and reasoning questions (Fu et al., 2024).
The benchmarks employ multiple input types (text, image, audio, arrays, temporal sequences), closed- or open-ended answers, and hierarchical or compositional question formats.
4. Evaluation Protocols and Metrics
The evaluation philosophy is to prioritize trustworthy, quantitative metrics that are robust to guessing, prompt/format variance, and annotation leakage. Common metrics across MME variants include:
- Accuracy: Standard for closed-set (e.g., yes/no, multiple-choice) or containment match (for free-form, single-answer tasks) (Fu et al., 2023, Ruan et al., 19 Aug 2025, Zhang et al., 2024).
- Chain-of-Thought (CoT) Scores: For reasoning-heavy tasks, e.g., in MME-Emotion, the CoT-S metric integrates stepwise reasoning judgment (Rea-S) with final label recognition (Rec-S) via a weighted sum (Zhang et al., 11 Aug 2025).
- F1, BERT F1, Cosine Similarity: Employed for free text or ranking, e.g., in Human-MME’s short-answer and ranking tasks (Liu et al., 30 Sep 2025).
- Specialized Measures: Scene/trajectory alignment (e.g., nDTW, SPL for navigation), intersection-over-union (grounding), macro-F1 (partial matches in multi-label/“unanswerable” QA) (Zhao et al., 31 Dec 2025, Zhang et al., 25 Jul 2025, Liu et al., 30 Sep 2025).
Protocols mandate rigid output formatting to minimize ambiguity and enable fair, automatic scoring. Many include human or LLM-based verification for open-ended outputs or subjective tasks (Zhang et al., 11 Aug 2025, Gan et al., 2024).
5. Empirical Insights and Model Comparisons
The MME family enables systematic, cross-model, and cross-task comparisons under zero-shot or unified-prompt conditions. Large-scale studies reveal persistent gaps:
- Generalization Limits: No state-of-the-art MLLM has achieved even moderate performance (<60%) on high-difficulty, high-resolution, real-world MME benchmarks (e.g., MME-RealWorld, MME-SCI, MME-Emotion) (Ruan et al., 19 Aug 2025, Zhang et al., 2024, Zhang et al., 11 Aug 2025).
- Task-Specific Weaknesses: Strong open-source models may still lag by 13–20 points behind closed-source models, especially in image-only or fine-grained reasoning tasks (Ruan et al., 19 Aug 2025, Zhang et al., 25 Jul 2025).
- Multimodal Fusion Challenges: Multimodal fusion (e.g., audio-visual, omnimodal inputs) often underperforms bimodal or text-visual models, suggesting fusion remains an unsolved challenge (Zhang et al., 11 Aug 2025).
- Chain-of-Thought Paradox: While stepwise reasoning can boost performance on complex reasoning, it may degrade performance on pure perception or simple tasks due to “overthinking” or output drift (Jiang et al., 13 Feb 2025).
- Domain-Specific Headroom: Specialist-trained models approach, but rarely surpass, “generalists” on their own domains; general LLMs do not transfer robustly to highly technical or specialty MME settings (e.g., finance, industry, ESG) (Gan et al., 2024, Yi et al., 28 Jan 2025, Zhang et al., 25 Jul 2025).
- Benchmark-Driven Progress: Ablations and leaderboard comparisons are used to inform architectural/training innovations and prompt targeted data curation (Fu et al., 2024, Zhang et al., 11 Aug 2025, Liu et al., 30 Sep 2025).
6. Impact, Limitations, and Open Problems
The MME benchmark family is a critical infrastructure for characterizing, comparing, and diagnosing MLLMs at scale. Its influence extends to leaderboards, model-card reporting, and targeted ablation studies across academic and industrial AI labs.
Key open problems surfaced by MME benchmarks include:
- Calibration of Task Difficulty: Most lack explicit difficulty annotation (“easy/medium/hard”), complicating error and progress analysis (Zhang et al., 11 Aug 2025).
- Hard Multimodal Fusion: Existing fusion modules and objectives insufficiently capture inter-modal alignment, particularly with longer, noisy, or real-world inputs (Zhang et al., 11 Aug 2025, Zhang et al., 2024).
- Multilingual Gaps: Even strong models show sharp performance drops in non-English scenarios, as explicitly measured in MME-SCI (Ruan et al., 19 Aug 2025).
- Reasoning and Hallucination: Chain-of-thought and judgment-style questions expose persistent failure modes (hallucination, refusal precision, over-refusal) (Jiang et al., 13 Feb 2025, Liu et al., 30 Sep 2025).
- Human-Like Mutual Reasoning: Mutual, multi-person or multi-image understanding tasks remain particularly challenging, with state-of-the-art models substantially below human parity (Liu et al., 30 Sep 2025).
Suggested benchmark extensions include stratified difficulty calibration, multi-turn interactions, multilingual/cultural stratification, real-time and continual learning tracks, and integration of human-judged or adversarial probing (Zhang et al., 11 Aug 2025, Fu et al., 2024, Liu et al., 30 Sep 2025).
7. Comparative Table of Selected MME Benchmarks
| Benchmark | Domain(s) | Size (QA Pairs) | Modalities | Notable Metrics/Tasks |
|---|---|---|---|---|
| MME (Fu et al., 2023) | General | 14 subtasks | Images/Text | Yes–no per task, cognition |
| MME-Emotion (Zhang et al., 11 Aug 2025) | Affective | >6,500 | Video/Audio/Text | Rec-S/Rea-S/CoT-S, reasoning |
| MME-SCI (Ruan et al., 19 Aug 2025) | Science, Multiling. | 1,019 × 5 | Img/Text/Hybrid | Knowledge-point accuracy |
| Video-MME (Fu et al., 2024) | Video Analysis | 2,700 | Video/Audio/Subs | MCQ (accuracy), temporal |
| MME-Industry (Yi et al., 28 Jan 2025) | Industrial | 1,050 | Images/Text | MCQ accuracy, 21 sectors |
| Human-MME (Liu et al., 30 Sep 2025) | Human-Centric | 19,945 | Image | Grounding, SA, ranking, causality |
| MME-RealWorld (Zhang et al., 2024) | Real-World | 29,429 | Hi-res Image | Perception+reasoning, MCQ |
| MME-Finance (Gan et al., 2024) | Finance | 1,171/1,103 | Image (bilingual) | 3-level reasoning, open-ended |
All sources are released for reproducibility and further innovation (Fu et al., 2023, Zhang et al., 11 Aug 2025, Ruan et al., 19 Aug 2025, Yi et al., 28 Jan 2025, Liu et al., 30 Sep 2025, Zhang et al., 2024, Gan et al., 2024, Fu et al., 2024).
References:
- “MME: A Comprehensive Evaluation Benchmark for Multimodal LLMs” (Fu et al., 2023)
- “MME-Emotion: A Holistic Evaluation Benchmark for Emotional Intelligence in Multimodal LLMs” (Zhang et al., 11 Aug 2025)
- “MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal LLMs” (Ruan et al., 19 Aug 2025)
- “Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis” (Fu et al., 2024)
- “MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark” (Yi et al., 28 Jan 2025)
- “MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning” (Gan et al., 2024)
- “MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?” (Zhang et al., 2024)
- “Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal LLMs” (Liu et al., 30 Sep 2025)
- Additional: (Jiang et al., 13 Feb 2025, Yuan et al., 27 May 2025).