MIEB: Massive Image Embedding Benchmark (2504.10471v1)

Published 14 Apr 2025 in cs.CV and cs.CL

Abstract: Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal LLMs. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

PDF Abstract

This paper introduces the Massive Image Embedding Benchmark (MIEB), a comprehensive benchmark for evaluating image and image-text embedding models. Recognizing the fragmentation in current evaluation protocols, which often focus on narrow, task-specific settings (like image-text retrieval or zero-shot classification), MIEB aims to provide a unified framework to assess a broader spectrum of model capabilities.

MIEB comprises 130 individual tasks spanning 38 languages and is organized into 8 high-level categories:

Retrieval: Evaluates semantic similarity matching between images, texts, or interleaved image-text content, covering cross-modal, cross-lingual, and instruction-aware settings.
Document Understanding: Focuses on evaluating the visual understanding of documents, including dense text and complex layouts, often requiring OCR abilities. Implemented using the retrieval framework.
Linear Probing (Classification): Assesses the information encoded in image embeddings by training a linear classifier on frozen embeddings for various classification tasks (using a few-shot approach for efficiency).
Zero-shot Classification: Evaluates the ability to classify images by matching image embeddings directly to text embeddings representing class labels, without training a task-specific classifier.
Compositionality Evaluation (PairClassification): Measures models' understanding of compositional aspects in vision-language, such as object relationships and attributes, often involving distinguishing correct captions from subtly altered hard negatives.
Vision-centric QA: Tests specific visual understanding skills like object counting, spatial relationship identification, and perceiving artistic styles, framed as a retrieval task over a small set of options.
Clustering: Evaluates if image embeddings form meaningful clusters corresponding to labels using metrics like Normalized Mutual Information (NMI).
Visual STS (Semantic Textual Similarity): A novel protocol where text pairs from traditional STS tasks are rendered into images to assess how well vision encoders understand the relative semantics of texts visually, using Spearman correlation.

The benchmark emphasizes zero-shot evaluation, meaning models are evaluated on frozen embeddings without task-specific fine-tuning, except for the linear probing setup where a linear model is trained on fixed embeddings.

MIEB is built on principles of usability, aiming for simplicity (benchmarking new models in minimal code), extensibility (easy addition of new datasets), reproducibility (versioning models and datasets), and diversity (covering a wide range of tasks and capabilities). It extends the codebase and leaderboard of the Massive Text Embedding Benchmark (MTEB) (He et al., 2020 ).

The paper benchmarks 50 models, including Vision-only models (e.g., DINOv2 [oquab2024dinov2]), CLIP-style models (e.g., CLIP [radford2021learning], SigLIP [zhai2023sigmoid]), and MLLM-based models (e.g., E5-V [jiang2024e5], VLM2Vec [jiang2024vlm2vec], Voyage API [voyagemultimodal2024voyage]).

Key findings from the extensive evaluation include:

No single embedding model dominates across all task categories.
MLLM-based models generally achieve the highest overall performance, particularly excelling in visual text understanding (Document Understanding, Visual STS) and multilingual tasks. Their natural ability to handle interleaved inputs is beneficial.
However, MLLM-based models often perform worse than CLIP-style models in traditional computer vision tasks like linear probing and zero-shot classification, especially on fine-grained visual categories. This suggests a potential trade-off or loss of precision in purely visual representations compared to models trained primarily for visual tasks or standard image-text alignment.
CLIP-style models are strong in traditional tasks but show limitations in interleaved retrieval, visual text tasks, and multilingual settings unless specifically designed for them (like multilingual SigLIP). Scaling factors like model size and dataset quality positively impact performance in clustering, classification, and retrieval for these models.
The Visual STS task reveals that models trained with language supervision (like CLIP variants) possess inherent visual text reading capabilities, unlike purely visually-supervised models.
Performance on MIEB tasks, particularly Visual STS, shows a strong correlation with the performance of MLLMs using the same vision encoder on generative VQA and OCR benchmarks, suggesting MIEB can serve as a practical tool for selecting visual encoders for MLLMs.
Compositionality tasks remain highly challenging for current models, pointing to areas for future improvement in reasoning capabilities.

To address the computational cost of evaluating 130 tasks, the authors introduce MIEB-lite, a lightweight version with 51 representative tasks. MIEB-lite is designed to preserve task category coverage and inter-task correlations while significantly reducing evaluation time (e.g., 82.4% reduction for an 8B model). Evaluation results show a high correlation (Spearman 0.992, Pearson 0.986) between overall performance on MIEB and MIEB-lite.

MIEB serves as a valuable resource for the community by providing a standardized, broad, and challenging benchmark to drive the development of more universal and capable image and image-text embedding models. The benchmark, code, dataset, and leaderboard are publicly available.

Implementation Considerations and Applications:

Model Integration: The benchmark is designed to be easily extensible. To evaluate a new model, developers typically need to create a wrapper class that can load the model and provide methods to generate embeddings for images and texts. For models handling interleaved inputs, this capability should be exposed. Otherwise, the benchmark offers a default sum-pooling for interleaved embeddings, though model-specific methods are preferred if available.

# Pseudocode for a simple model wrapper
from mteb.model_meta import ModelMeta
from mteb.abstasks import AbsTask

class NewImageEmbeddingModel(AbsTask):
    # Define model metadata
    metadata = ModelMeta(
        name="NewModel",
        description="A new image embedding model",
        model_card="URL_to_model_card",
        # ... other metadata like languages, modalities, etc.
    )

    def __init__(self, model_name="new_model_checkpoint", **kwargs):
        super().__init__(**kwargs)
        # Load the model checkpoint
        self.model = self._load_model(model_name)

    def _load_model(self, model_name):
        # Implementation to load your specific model
        pass

    def encode_image(self, images, **kwargs):
        # Implementation to get image embeddings
        pass

    def encode_text(self, texts, **kwargs):
        # Implementation to get text embeddings
        pass

    # Optional: if the model handles interleaved inputs natively
    def encode_interleaved(self, inputs, **kwargs):
        # Implementation for interleaved encoding
        pass

# To run the benchmark on your model:
# Instantiate the model
# model = NewImageEmbeddingModel("path/to/checkpoint")
# Run evaluation using MTEB/MIEB framework
# evaluator = ... # setup evaluator
# results = evaluator.run(model, tasks=[...])

Computational Requirements: Evaluating on the full MIEB (130 tasks) is computationally intensive. The paper provides runtime examples (e.g., 264 GPU hours on an H100 for E5-V). Developers should leverage MIEB-lite (51 tasks) for quicker iteration and initial assessment, reserving the full benchmark for final evaluation. The benchmark is designed for GPU acceleration.
Model Selection: The detailed results across 8 categories provide insights into model strengths and weaknesses. Developers can use these results to select models based on the specific requirements of their application. For instance, if the application heavily involves understanding text within images or requires multilingual support, MLLM-based models like E5-V might be preferred despite their potential lower performance on fine-grained visual classification. For traditional image/image-text retrieval on English data, optimized CLIP variants might be better.
Benchmarking Novel Capabilities: MIEB highlights under-explored areas like compositionality and visual text understanding. Developers working on applications requiring these capabilities (e.g., sophisticated visual search, document AI, visually-grounded reasoning) can use the benchmark to identify models with promising performance and guide future research directions.
Limitations: The benchmark evaluates models on frozen embeddings. While this assesses the quality of the learned representation, it doesn't capture performance gains achievable through task-specific fine-tuning. However, for many real-world applications, using off-the-shelf, frozen embeddings is practical due to computational constraints and the need for a single general-purpose model. The benchmark also focuses on specific task types (retrieval, classification, etc.) and doesn't cover generative tasks directly, although the correlation paper suggests it can be predictive of generative MLLM performance.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Chenghao Xiao (21 papers)
Isaac Chung (5 papers)
Imene Kerboua (4 papers)
Jamie Stirling (2 papers)
Xin Zhang (904 papers)
Márton Kardos (7 papers)
Roman Solomatin (3 papers)
Noura Al Moubayed (40 papers)
Kenneth Enevoldsen (11 papers)
Niklas Muennighoff (56 papers)

GitHub

GitHub - embeddings-benchmark/mteb: MTEB: Massive Text Embedding Benchmark (2,399 stars)

Tweets

https://twitter.com/_reachsumit/status/1912047528595652609

MIEB: Massive Image Embedding Benchmark (2504.10471v1)

Related Papers

GitHub

Tweets