Massive Multimodal Embedding Benchmark

Updated 23 September 2025

Massive Multimodal Embedding Benchmark (MMEB) is a unified evaluation suite that tests universal embeddings for image, text, and video across 36 datasets.
It standardizes tasks like classification, visual question answering, retrieval, and grounding, ensuring robust in-distribution and out-of-distribution analysis.
MMEB’s evolving protocols, including extensions for video and document retrieval, drive innovations in contrastive learning and scalable multimodal representation.

The Massive Multimodal Embedding Benchmark (MMEB) is a comprehensive evaluation suite designed to assess universal embedding models across a broad spectrum of multimodal data and task types. Building on limitations in earlier benchmarks, MMEB and its successors systematically quantify model performance in classification, retrieval, visual question answering, and grounding over a diverse range of domains and modalities. MMEB provides both a standardized evaluation protocol for competitive baselines and a platform for measuring the progress of transfer learning, compositional reasoning, and robustness in large multimodal models.

1. Definition and Scope

MMEB defines a unified framework to evaluate “universal” multimodal embedding models—models that produce fixed-dimensional embeddings for any query-candidate input pair containing images and/or text, guided by task instructions. All tasks are consistently cast as ranking or retrieval problems: For each query (which may be image, text, or both), a large candidate pool (often 1,000+) is presented, and the model must identify the correct match. MMEB encompasses 36 datasets spanning four meta-tasks:

Classification: General and fine-grained image classification, multimodal attribute classification
Visual Question Answering (VQA): Including both generic and knowledge-intensive VQA, with multi-modal queries and supporting documents
Information Retrieval: Image-to-text, text-to-image, and more general cross-modal retrieval tasks
Visual Grounding: Phrase localization, cross-modal object matching, and fine-level grounding in images

Datasets are drawn from varied domains (news, Wikipedia, daily life, fine-grained scenes, and fashion), and are partitioned into 20 in-distribution (IND) and 16 out-of-distribution (OOD) sets, rigorously balancing generalization and robustness challenges (Jiang et al., 2024).

2. Evaluation Protocol and Metrics

Every dataset in MMEB is reformulated so that the core metric is Precision@1, i.e., the proportion of test queries for which the true matching candidate is ranked first out of all candidates. All queries, candidates, and task definitions are pre-processed into a unified template:

Query: Can be any combination of image and text, guided by an explicit task instruction.
Candidate Pool: Includes the correct item and a set of hard negatives.
Model Output: Embedding vectors for each query/candidate; final ranking by cosine similarity or suitable scoring function.

Variants of contrastive losses, including standard InfoNCE and its hardness-weighted extensions, are used for model training and evaluation in associated works (Jiang et al., 2024, Lan et al., 4 Mar 2025). The evaluation process uses both full fine-tuning and parameter-efficient adaptation (e.g., LoRA), and recent MMEB extensions accommodate scalable multi-vector retrieval (Xiao et al., 22 Sep 2025).

3. Benchmark Evolution and Expansions

MMEB-V2: Video and Visual Document Extension

MMEB-V2 expands the benchmark to 78 tasks by including five new meta-tasks encompassing video and structured document analysis (Meng et al., 7 Jul 2025):

Visual Document Retrieval (PDF, slides)
Video Retrieval
Temporal (Moment) Retrieval
Video Classification
Video Question Answering

This extension introduces both static and temporal modalities, assessing the flexibility of embedding models across images, videos, text, and structured documents. Evaluation tasks demand models to tackle variable-length video, high-resolution PDF layouts, and multimodal document queries.

MMT-Bench (Ying et al., 2024): Focuses on meta-task diversity and AGI-oriented evaluation over 32 meta-tasks and 162 subtasks (multi-choice format).
MIBench (Liu et al., 2024): Specializes in multi-image input scenarios (fine-grained comparison, temporal/logical reasoning, in-context learning).
MEGA-Bench (Chen et al., 2024): Scales to 505 tasks with diverse output formats (numbers, phrases, code, LaTeX, JSON), emphasizing real-world applicability and reporting.
MMMU-Pro (Yue et al., 2024): Introduces robustness via filtering out text-only answerable items, augmented multiple-choice fields, and realistic vision-only input settings.

4. Modeling and Training Strategies

Contrastive Training With Task Instructions

MMEB and its leading baselines employ instruction-guided contrastive learning: for each (query, candidate) pair, both inputs are augmented with explicit task instruction text, then embedded and scored in a joint space. The dominant loss is the InfoNCE:

$\mathcal{L} = -\log \frac{\exp(\text{sim}(\mathbf{q}, \mathbf{d}^+)/\tau)}{ \exp(\text{sim}(\mathbf{q}, \mathbf{d}^+)/\tau) + \sum_{d^-} \exp(\text{sim}(\mathbf{q}, \mathbf{d}^-)/\tau) }$

where $\mathbf{q}$ and $\mathbf{d}^+$ are embeddings, $\tau$ is a learned or fixed temperature, and negatives $\mathbf{d}^-$ are sampled from the batch (possibly via hard negative mining) (Jiang et al., 2024, Lan et al., 4 Mar 2025).

Advanced Loss and Negative Sampling

To address the overlap between positive/negative similarity distributions, LLaVE introduces a hardness-weighted contrastive loss (Lan et al., 4 Mar 2025):

$\mathcal{L}_i = -\log\left( \frac{ e^{r_\pi(\mathbf{q}_i, \mathbf{t}_i)} } { e^{r_\pi(\mathbf{q}_i, \mathbf{t}_i)} + \sum_{j \ne i} e^{ r_\pi(\mathbf{q}_i, \mathbf{t}_j) + r_\theta(\mathbf{q}_i, \mathbf{t}_j) } } \right)$

where $r_\theta$ assigns higher weights to harder negatives, focusing learning signal on cases with more ambiguity.

Universal Model Adaptation and Instruction-Prompt Techniques

Multiple frameworks adapt LLMs or MLLMs for embedding via hierarchical prompt injection (Ju et al., 1 Aug 2025), teacher-student knowledge distillation and hard negative filtering (Gu et al., 24 Apr 2025), or bidirectional continual pre-training with joint masked reconstruction (Chen et al., 29 Jun 2025). These approaches are used to overcome the limitations of standard causal attention or insufficient cross-modal interaction.

Multi-Vector Embeddings and Late Interaction

MetaEmbed introduces a scalable alternative to single-vector embeddings, appending learnable "Meta Tokens" to input, whose contextualized hidden states serve as multi-vector representations. Retrieval similarity is computed via late interaction:

$\text{LI}(\mathbf{q}, \mathbf{c}) = \sum_{i=1}^{N_\mathbf{q}} \max_{j=1}^{N_\mathbf{c}} \langle E_\mathbf{q}^{(i)}, E_\mathbf{c}^{(j)} \rangle$

"Matryoshka" nested training ensures that varying prefix lengths of meta vectors remain discriminative, allowing dynamic efficiency/quality trade-offs at retrieval (Xiao et al., 22 Sep 2025).

5. Data Synthesis and Multilingual Evaluation

To address supervision bottlenecks, high-quality synthetic data generation frameworks such as MegaPairs (Zhou et al., 2024) and mmE5 (Chen et al., 12 Feb 2025) are widely adopted. MegaPairs, for example, leverages VLMs and MLLMs in a two-step process: correlated image pairs are mined via CLIP/DINOv2; detailed instructions describing semantic relationships are generated using MLLMs and LLMs. mmE5 ensures coverage across 93 languages, robust cross-modal alignment, and high fidelity via a process involving multi-aspect MLLM generation, self-evaluation, and revision within a single forward pass.

Performance on both MMEB and XTD (multilingual T2I retrieval) demonstrates that models trained on strongly aligned, diverse synthetic datasets outperform baselines—even with less data—showcasing the importance of data quality and alignment.

6. Transfer, Scalability, and Future Directions

Scaling analyses reveal consistent performance improvement with increased model and dataset size (Chen et al., 29 Jun 2025, Xiao et al., 22 Sep 2025, Meng et al., 7 Jul 2025). Methods such as bidirectional pre-training, heterogeneous contrastive fine-tuning, and instruction-guided task unification improve robustness to OOD domains and enable zero-shot generalization to video, document, and cross-lingual retrieval. As models like MetaEmbed scale to 32B parameters, flexible late-interaction multi-vector methods further improve both retrieval quality and efficiency.

Recent trends emphasize:

Modal-universal architectures: Unified modeling across images, videos, documents, text.
Zero-shot and efficient adaptation: Strong performance without large-scale contrastive pre-training, via prompt engineering and hard negative self-mining (Ju et al., 1 Aug 2025).
Community-standardized evaluations: MMEB and related benchmarks (MMT-Bench, MEGA-Bench, MIBench) drive method-agnostic, reproducible comparative analysis.

A plausible implication is continued expansion in benchmark tasks (especially video, audio, structured documents), refined negative selection and supervision strategies, and increased focus on real-world, multimodal, and multilingual scenarios.

Benchmark	Scope / Modality	Key Innovations
MMEB (Jiang et al., 2024)	36 datasets, 4 meta-tasks	Instruction-guided ranking, OOD splits
MMEB-V2 (Meng et al., 7 Jul 2025)	+Video, +Documents (78 tasks)	Unified video/document/image/text support
MMT-Bench (Ying et al., 2024)	32 meta, 162 subtasks, MCQ	Taskonomy; meta-task navigation
MEGA-Bench (Chen et al., 2024)	505 diverse tasks, free-form answers	40+ custom metrics; multidimensional reporting
MIBench (Liu et al., 2024)	Multi-image (13K samples, 13 tasks)	Multi-image reasoning, knowledge, ICL
MMMU-Pro (Yue et al., 2024)	Robust MCQ, vision-only, augmentation	Short-cut filtering, real-world-simulated
EvalMi-50K (Wang et al., 11 Apr 2025)	T2I generation QA and MOS	LMM-based evaluation metric, 20 tasks

Each benchmark provides unique coverage and evaluation methodology, reflecting the increasing breadth and real-world relevance of universal multimodal embedding research.

In summary, MMEB and its extensions represent a pivotal resource for the measurement and advancement of universal multimodal embedding models. Through comprehensive task diversity, unified evaluation, and integration of cutting-edge model and data generation techniques, MMEB drives progress in semantic representation, cross-modal retrieval, and real-world deployment of multimodal machine learning systems.