MMEB-V2: Multimodal Embedding Benchmark

Updated 8 March 2026

MMEB-V2 is a unified benchmark evaluating multimodal embeddings across text, image, video, and document tasks with clear, instruction-conditioned queries.
It integrates 78 tasks spanning retrieval, classification, grounding, and reasoning while using metrics like Precision@1 and NDCG@5 for clear performance insights.
The design employs instruction-guided contrastive learning and strategic sub-batching to enhance model generalization on diverse, real-world queries.

MMEB-V2 (Massive Multimodal Embedding Benchmark, Version 2) is an expanded and systematically engineered benchmark suite evaluating multimodal embedding models on retrieval, classification, grounding, and reasoning tasks spanning text, images, videos, and visual documents. It was introduced as both a quantitative barometer for generalist multimodal models and as a platform to catalyze research into unified embedding architectures that generalize beyond natural images to videos and structured visual documents (Meng et al., 7 Jul 2025).

1. Scope and Rationale

MMEB-V2 extends the original MMEB, which focused on image-text retrieval and associated tasks, by integrating video and visual document modalities to reflect real-world search, recommendation, and retrieval-augmented generation scenarios. The stated rationale is the documented degradation of past embedding models’ performance when confronted with complex or cross-modal queries outside conventional image–text pairs. MMEB-V2 explicitly evaluates models over four input modalities: text (T), image (I), video (V), and visual document (D), thus enforcing handling of static scenes, temporal sequences, and multi-page structured visuals (Meng et al., 7 Jul 2025).

MMEB-V2 increases the breadth from MMEB’s 36 datasets to a total of 78 tasks, incorporating new meta-task categories designed to expose unique modality-specific and cross-modal challenges.

2. Task Structure and Datasets

All MMEB-V2 tasks are structured around a “query–candidate set” paradigm, with a single correct answer per candidate pool. The new meta-task groups introduced in MMEB-V2 are:

Meta-Task Category	Example Datasets	# Queries	# Candidates
Video Retrieval (V-RET)	DiDeMo, MSR-VTT	1,004–4,468	same as # queries
Moment Retrieval (M-RET)	QVHighlights, Charades-STA	727–1,800	10
Video Classification (V-CLS)	Kinetics-700, Something-Something V2	433–1,000	51–700
Video QA (V-QA)	MVBench, Video-MME	500–8,564	2–5
Visual Document Retrieval	ViDoRe-V1, VisRAG, ViDoSeek	52–1,646	452–9,590

Many datasets are downsampled to promote benchmarking speed and reproducibility. Video inputs are uniformly subsampled to 8 frames/clip; visual document pages are rasterized.

3. Input–Output Formalism and Evaluation Protocols

Queries are composed as multimodal tuples with instruction conditioning, typically following:

1
2
3

q_inst = [VISUAL_TOKEN]
         “Instruct: {task_instruction}”
         “Query: {q}”

[VISUAL_TOKEN] distinguishes between input types (e.g., <$\!|\text{video_pad}|\!>$).

Candidates are single-modality items (labels, clips, doc pages). The model’s objective is always top-1 retrieval/classification:

Retrieval and Grounding: $Recall@K = \frac{1}{N} \sum_{i=1}^N 1\{r_i \le K\}$
Document Ranking: $NDCG@5$ to reflect graded relevance
Classification and QA: Hit@1 accuracy
Temporal Grounding: Hit@1 (no IoU threshold)

The benchmark enforces a contrastive ranking training protocol, using InfoNCE loss:

$\mathcal{L} = -\log \frac{\exp\left(\frac{1}{\tau} \cos(h_q, h_{t^+})\right)}{\exp\left(\frac{1}{\tau} \cos(h_q, h_{t^+})\right) + \sum_{t^-} \exp\left(\frac{1}{\tau} \cos(h_q, h_{t^-})\right)}$

where $h_q, h_{t^{+/-}}$ are query/target embeddings and $\tau$ is a fixed temperature.

In all reported MMEB-V2 retrieval experiments, the primary metric is Precision@1 (the special case of Recall@1 where candidates are unique) (Zhu et al., 29 Sep 2025).

4. Quantitative and Comparative Results

Experiments with models such as VLM2Vec-V2, GME, and FreeRet demonstrate MMEB-V2’s utility:

Image: VLM2Vec-V2 (2B) achieves 64.9 Hit@1, outperforming GME (7B, 56.0).
Video: GME (7B) leads with 38.6 Hit@1; VLM2Vec-V2 (2B) achieves 34.9 with substantially less video data (Meng et al., 7 Jul 2025).
Visual Document: Best NDCG@5 (GME: 75.2, VLM2Vec-V2: 65.4) on scanned document retrieval tasks.

Ablation studies suggest that training on all modalities (image, video, document) yields highest aggregate performance, and that interleaved sub-batching during training is crucial for stable convergence and balanced modality transfer effects.

FreeRet establishes a strong baseline for training-free retrieval by leveraging off-the-shelf MLLMs. On MMEB-V2’s video classification tasks, FreeRet (Qwen2-VL 7B) attains 63.2 Precision@1 versus 37.4 for GME (contrasted with 8M additional training pairs), and performs similarly on video retrieval (Zhu et al., 29 Sep 2025). This suggests properly harnessed pretrained MLLMs can outperform specialist contrastive encoders even without in-domain training.

5. Design Principles and Best Practices

MMEB-V2 enforces unified multimodal formatting, with instruction-conditioned queries that serve to clarify task objectives for the model. Key architecture and training design decisions identified by VLM2Vec-V2 include:

Use of a single backbone (e.g., Qwen2-VL with M-RoPE) for all four modalities to support variable-length/structure input.
Strategic sampling over sub-batches to mix “hard” in-domain and “diverse” cross-domain negatives.
Instruction-guided contrastive learning, where task instructions embedded in the data significantly improve transfer across modalities.

For retrieval-centric models like FreeRet, specific ablations demonstrate:

Discarding the final MLP head (lexicalization bypass) significantly increases retriever performance (by 5–6 Precision@1 points).
Controlled prompts with semantic and noise priors further increase embedding fidelity.
MCQ-style reranking robustly mitigates framing bias compared to Yes/No or True/False prompt formats (Zhu et al., 29 Sep 2025).

MMEB-V2 underpins newer specialized evaluation suites. For example, the MMEVerse benchmark for multimodal emotion understanding leverages MMEB-V2 infrastructure for emotion recognition, reasoning, and sentiment classification across twelve aggregated datasets, using similar multimodal input conventions and metrics (e.g., accuracy, macro-F1, mAP, and reasoning quality via LLM-based overlap) (Peng et al., 23 Jan 2026). This underscores MMEB-V2’s flexibility as a substrate for both general and specialized multimodal research.

A plausible implication is that the MMEB-V2’s breadth and rigor will continue to drive development of both generalist and domain-optimized multimodal embedding models, especially as tasks move toward high-fidelity, instruction-based reasoning and explanation.

7. Practical Use and Benchmarking Considerations

MMEB-V2 is designed to be computationally tractable for academic use, with held-out test splits and manageable candidate set sizes, enabling end-to-end benchmarking on current GPU infrastructure (e.g., 2 hours for all 78 tasks on 8 GPUs). Test splits derive from official dataset splits where available; video/document preprocessing is standardized to facilitate model comparability.

Reported limitations include current metric focus (Precision@1/Recall@1) and challenge posed by visually similar negatives in scene-centric or template-heavy retrieval settings. Nevertheless, MMEB-V2 provides a rigorous and extensible basis for measurement of unified, multimodal understanding in large-scale neural models (Meng et al., 7 Jul 2025, Zhu et al., 29 Sep 2025).