Papers
Topics
Authors
Recent
2000 character limit reached

MME Benchmark Evaluation

Updated 9 February 2026
  • MME Benchmark is a standardized evaluation framework for multimodal large language models, ensuring leakage-free testing and broad subtask coverage.
  • It uses manually authored prompts across 14 subtasks (e.g., OCR, numeric reasoning) to provide fair and scalable model assessments.
  • Extensions like MME-Emotion and MME-Finance highlight its adaptability, addressing domain-specific challenges and multimodal fusion limitations.

A “MME Benchmark” is a standardized evaluation framework designed to assess capabilities of multimodal LLMs (MLLMs), with “MME” standing for “Multimodal Model Evaluation.” The term is specifically associated with a series of rigorous, leakage-avoiding, and task-diverse benchmarks that originated with the introduction of the original MME benchmark for MLLMs, later spawning a large family of domain-focused and capability-focused extensions. These benchmarks are cited extensively in technical evaluation of MLLMs, covering a highly diverse range of multimodal tasks, metrics, settings, and practical domains.

1. Origins and Evolution of MME Benchmarks

The original “MME” benchmark (“A Comprehensive Evaluation Benchmark for Multimodal LLMs” (Fu et al., 2023)) was introduced to address the lack of unified, fair, and leakage-free yardsticks for evaluating the emergent abilities of MLLMs. It covers both low-level perception and high-level cognition across 14 subtasks, such as object existence, counting, color, OCR, and numeric/coding reasoning, using only manually authored question–answer pairs to avoid overlap with model pretraining data. Each subtask adopts a rigid, concise (yes/no) prompt format, ensuring uniformity and minimizing prompt-engineering bias.

The original MME framework was rapidly embraced as a standard in the field, and has since catalyzed a wave of domain- and skill-specific MME benchmarks. Examples include:

  • MME-Emotion: systematic evaluation of emotional intelligence and reasoning in MLLMs over realistic video scenarios (Zhang et al., 11 Aug 2025)
  • MME-SCI: comprehensive diagnosis of MLLMs’ science reasoning across five languages, three input modes, and 63 fine-grained knowledge points (Ruan et al., 19 Aug 2025)
  • MME-Industry: cross-industry multimodal tasks with expert validation, measuring domain transfer in technical and practical settings (Yi et al., 28 Jan 2025)
  • Video-MME: quantitative benchmarking for video understanding, integrating audio, subtitle, and long-term temporal reasoning (Fu et al., 2024)
  • MME-Finance: expert-level open-ended VQA in financial visual scenarios with bilingual, real-world chart/statement inputs (Gan et al., 2024)
  • Human-MME: holistic assessment of human-centric understanding from fine-grained perception to high-level causal inference in images (Liu et al., 30 Sep 2025)
  • MME-RealWorld: large-scale, high-resolution challenge set focused on difficult real-world scenarios (Zhang et al., 2024)

This proliferation demonstrates the MME methodology’s adaptability and centrality as an evaluation paradigm.

2. Benchmark Design Principles and Scope

Core principles of MME benchmarks, as derived from the original blueprint and its leading descendents, are:

  • Broad Subtask Coverage: Each benchmark targets a comprehensive set of abilities (e.g., recognition, reasoning, grounding, chain-of-thought generation) with minimal dependence on domain-knowledge or OCR shortcuts (Fu et al., 2023, Fu et al., 2024, Yi et al., 28 Jan 2025).
  • Manual Design and Data Integrity: All prompts and answers are manually crafted and validated, with data leakage prevention procedures such as using only raw images from public sets but authoring novel questions (Fu et al., 2023, Yi et al., 28 Jan 2025).
  • Uniform Prompt and Output Protocols: Strictly formatted instructions ensure fair comparisons and ease of metric collection (e.g., yes/no outputs or closed-set answer tags) (Fu et al., 2023).
  • Scalability and Diversity: The latest MME benchmarks reach thousands to tens of thousands of samples (e.g., MME-RealWorld: 29,429 QAs; MME-Emotion: 6,500 QA pairs), often spanning dozens of domains, modalities (image, audio, video, text), and languages (Zhang et al., 2024, Zhang et al., 11 Aug 2025, Ruan et al., 19 Aug 2025).
  • Progressive Difficulty and Annotation Granularity: Tasks range from single-step perception to multi-step reasoning and holistic causal inference, often with layered annotations (e.g., knowledge-point tagging, fine-grained evidence, spatial/temporal structure) (Ruan et al., 19 Aug 2025, Zhang et al., 11 Aug 2025, Liu et al., 30 Sep 2025).
  • Leakage Control & Fairness: By eschewing repurposed test questions from popular datasets, MME benchmarks avoid overestimation due to pretraining exposure (Fu et al., 2023).

3. Task Structure and Subdomain Specialization

Each member of the MME benchmark family is characterized by its selection of modalities, evaluation axes, and domain focus. For example:

  • MME-Emotion: Eight emotion-centric video tasks covering controlled, wild, noisy conditions, fine- or multi-label recognition, sentiment analysis, and intent (Zhang et al., 11 Aug 2025).
  • MME-SCI: Science-knowledge benchmarks in five languages, leveraging text-only, image-only, and hybrid modes, with 63 knowledge-point granularity (Ruan et al., 19 Aug 2025).
  • MME-Industry: 21 sectors from electronics to medical, with 50 hand-crafted visual multiple-choice questions per domain (Yi et al., 28 Jan 2025).
  • MME-RealWorld: Five high-difficulty scenarios (OCR in the wild, diagrams and tables, remote sensing, autonomous driving, video monitoring), >13,000 unique high-resolution images (Zhang et al., 2024).
  • Human-MME: Eight progressive dimensions from pose and attribute grounding to intention/causal/emotion discrimination, using ~20,000 curated QA pairs (Liu et al., 30 Sep 2025).
  • MME-Finance: Bilingual, expert-curated, open-ended evaluation on financial charts, tables, and photos, with multi-level reasoning from OCR to subjective investment advice (Gan et al., 2024).
  • Video-MME: 900 videos (254 hours), 2,700 multi-step video QA pairs, 12 tasks including temporal, spatial, and reasoning questions (Fu et al., 2024).

The benchmarks employ multiple input types (text, image, audio, arrays, temporal sequences), closed- or open-ended answers, and hierarchical or compositional question formats.

4. Evaluation Protocols and Metrics

The evaluation philosophy is to prioritize trustworthy, quantitative metrics that are robust to guessing, prompt/format variance, and annotation leakage. Common metrics across MME variants include:

Protocols mandate rigid output formatting to minimize ambiguity and enable fair, automatic scoring. Many include human or LLM-based verification for open-ended outputs or subjective tasks (Zhang et al., 11 Aug 2025, Gan et al., 2024).

5. Empirical Insights and Model Comparisons

The MME family enables systematic, cross-model, and cross-task comparisons under zero-shot or unified-prompt conditions. Large-scale studies reveal persistent gaps:

6. Impact, Limitations, and Open Problems

The MME benchmark family is a critical infrastructure for characterizing, comparing, and diagnosing MLLMs at scale. Its influence extends to leaderboards, model-card reporting, and targeted ablation studies across academic and industrial AI labs.

Key open problems surfaced by MME benchmarks include:

  • Calibration of Task Difficulty: Most lack explicit difficulty annotation (“easy/medium/hard”), complicating error and progress analysis (Zhang et al., 11 Aug 2025).
  • Hard Multimodal Fusion: Existing fusion modules and objectives insufficiently capture inter-modal alignment, particularly with longer, noisy, or real-world inputs (Zhang et al., 11 Aug 2025, Zhang et al., 2024).
  • Multilingual Gaps: Even strong models show sharp performance drops in non-English scenarios, as explicitly measured in MME-SCI (Ruan et al., 19 Aug 2025).
  • Reasoning and Hallucination: Chain-of-thought and judgment-style questions expose persistent failure modes (hallucination, refusal precision, over-refusal) (Jiang et al., 13 Feb 2025, Liu et al., 30 Sep 2025).
  • Human-Like Mutual Reasoning: Mutual, multi-person or multi-image understanding tasks remain particularly challenging, with state-of-the-art models substantially below human parity (Liu et al., 30 Sep 2025).

Suggested benchmark extensions include stratified difficulty calibration, multi-turn interactions, multilingual/cultural stratification, real-time and continual learning tracks, and integration of human-judged or adversarial probing (Zhang et al., 11 Aug 2025, Fu et al., 2024, Liu et al., 30 Sep 2025).

7. Comparative Table of Selected MME Benchmarks

Benchmark Domain(s) Size (QA Pairs) Modalities Notable Metrics/Tasks
MME (Fu et al., 2023) General 14 subtasks Images/Text Yes–no per task, cognition
MME-Emotion (Zhang et al., 11 Aug 2025) Affective >6,500 Video/Audio/Text Rec-S/Rea-S/CoT-S, reasoning
MME-SCI (Ruan et al., 19 Aug 2025) Science, Multiling. 1,019 × 5 Img/Text/Hybrid Knowledge-point accuracy
Video-MME (Fu et al., 2024) Video Analysis 2,700 Video/Audio/Subs MCQ (accuracy), temporal
MME-Industry (Yi et al., 28 Jan 2025) Industrial 1,050 Images/Text MCQ accuracy, 21 sectors
Human-MME (Liu et al., 30 Sep 2025) Human-Centric 19,945 Image Grounding, SA, ranking, causality
MME-RealWorld (Zhang et al., 2024) Real-World 29,429 Hi-res Image Perception+reasoning, MCQ
MME-Finance (Gan et al., 2024) Finance 1,171/1,103 Image (bilingual) 3-level reasoning, open-ended

All sources are released for reproducibility and further innovation (Fu et al., 2023, Zhang et al., 11 Aug 2025, Ruan et al., 19 Aug 2025, Yi et al., 28 Jan 2025, Liu et al., 30 Sep 2025, Zhang et al., 2024, Gan et al., 2024, Fu et al., 2024).


References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MME Benchmark.