RSMEB: Remote Sensing Multimodal Benchmark

Updated 19 December 2025

RSMEB is a unified, large-scale evaluation suite that assesses remote sensing multimodal embeddings by integrating semantic content with spatial/geographic reasoning.
It employs a ranking-based protocol across 21 tasks—including classification, retrieval, VQA, and geo-localization—to benchmark single-encoder models without post-training.
Baseline comparisons show significant P@1 improvements with models like VLM2GeoVec, highlighting both achievements and challenges in precise spatial localization.

Remote-Sensing Multimodal Embedding Benchmark (RSMEB) is a unified, large-scale evaluation suite specifically designed to quantify the capabilities of multimodal embedding models on a diverse range of remote sensing (RS) tasks. It was introduced to address the absence of a single benchmark that jointly evaluates “what” (semantic content) and “where” (spatial/geographical reasoning) performance in remote sensing, integrating both region-level spatial cues and holistic scene understanding (Aimar et al., 12 Dec 2025).

1. Motivation and Design Principles

RSMEB fills a critical evaluation gap. Traditional RS benchmarks are narrowly specialized (scene classification, retrieval, VQA, or object detection), while generic vision–language benchmarks lack RS-relevant modality cues such as geographic coordinates and bounding boxes. RSMEB adopts a ranking-based protocol for all tasks, enabling single-encoder models to be evaluated across classification, retrieval, grounding, visual question answering (VQA), and geo-localization. The benchmark explicitly incorporates both region-level (bounding box) and geospatial (latitude/longitude) reasoning to test fine-grained “where” understanding. All evaluations are zero-shot—models are not trained on any RSMEB subset post-pretraining.

2. Task Structure and Modalities

RSMEB spans 21 individual tasks grouped into six meta-tasks:

Scene Classification: Rank N class-name candidates for a given satellite image and instruction.
Cross-Modal Retrieval: Image-to-text (I→T) and text-to-image (T→I) retrieval tasks across datasets such as RSITMD, RSICD, and UCM-caption.
Compositional Retrieval (rCIR): Retrieve a full-scene image conditioned on a region crop and a text modifier.
Visual Question Answering: Multiple-choice presence, comparison, and rural/urban queries over images.
Visual Grounding & Region Reasoning: Includes
- Referring-Expression Retrieval (RefExp): rank image-crop candidates given an image and a referring expression.
- Region-Caption Retrieval (RegCap): rank text-caption candidates for a given image and bounding box.
- Grounded T2I (GrT2I): rank full-scene images given text and bounding boxes.
Semantic Geospatial Retrieval (GeoT2I): Retrieve the matching image given text and (latitude, longitude) input.

All task queries are interleaved token streams of images, text, bounding boxes (as four normalized scalars), and geo-coordinates (as serialized text tuple "(lat, lon)"). The target pool consists of class labels, captions, image crops, or full-scene images, depending on the application.

3. Datasets and Preprocessing

RSMEB aggregates public RS datasets—AID, Million-AID, RSI-CB, EuroSAT, UCMerced, PatternNet (scene classification); RSITMD, RSICD, UCM-caption (retrieval); LRBEN/HRBEN (VQA); and proprietary splits for rCIR, RefExp, RegCap, GrT2I, and GeoT2I—without retraining. Test set sizes include:

Dataset/Task	# Queries	# Candidates/Classes
AID	2,000	30 classes
Million-AID	10,000	51 classes
RSI-CB	24,747	35 classes
RSITMD	variable	variable
rCIR	1,818	1,115 full images
RefExp, GeoT2I	2,000	2,000 candidates

All images are resampled (e.g., 336×336), tokenized with a Vision Transformer (ViT-L/14 at 14×14 patches), and regions/crops are processed at native resolution then rescaled.

4. Evaluation Protocols and Metrics

Every task is cast as a ranking or retrieval problem. Let $N$ be the number of queries.

Classification Accuracy:

$\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\hat c_i = c_i)$

where $\hat c_i$ is the top-1 prediction.

Retrieval Recall@K:

$\mathrm{Recall@}K = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}(\mathrm{gt}_i\in \mathrm{top\-}K)$

Averages reported over R@1, R@5, R@10 for cross-modal retrieval.

Precision@1 (P@1) for single-correct target tasks (e.g., RegCap, RefExp, rCIR, GrT2I, GeoT2I, VQA):

$\mathrm{P@1} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\hat t_i = t_i^\ast)$

where $\hat t_i$ is top-1 prediction.

Contrastive pretraining is measured by InfoNCE loss.

5. Baseline Comparisons and Results

RSMEB evaluations compare general VLMs (CLIP ViT-L/14, VLM2Vec) and remote sensing-specialized dual encoders (RemoteCLIP, SkyCLIP, GeoRSCLIP) with the single-encoder VLM2GeoVec. Key findings include:

Meta-task/Task	Baseline Best	VLM2GeoVec (P@1, %)	Absolute Δ (pp)
Region-Caption Retrieval (RegCap)	1.25 (dual encoder)	26.56	+25.31
Referring-Expression (RefExp)	13.15 (CLIP)	32.50	+19.35
Semantic Geo-T2I	5.10 (SkyCLIP)	17.80	+12.70
Region-CIR (rCIR)	3.96 (SkyCLIP)	22.99	+19.03
Grounded T2I (GrT2I)	0.98 (CLIP)	13.70	+12.72
Cross-modal Retrieval (avg. R@K)	RemoteCLIP 45.2	VLM2GeoVec 37.2	(2nd place)
Scene Classification (avg. acc.)	VLM2GeoVec 66.1	VLM2GeoVec 66.1	Top performer
VQA (avg P@1, LRBEN/HRBEN)	GeoChat 83.4, VLM2GeoVec 83.1	83.1	Top embedder

Precision@1 improvements on region-level and geo-localization tasks are +12–25 percentage points over prior state-of-the-art, with the VLM2GeoVec obtaining the leading Friedman ranking score (1.93) out of all tested models (Aimar et al., 12 Dec 2025).

6. Challenges, Insights, and Future Directions

Spatial Localization Bottleneck: Region-caption, referring expression, and grounded T2I tasks remain the most challenging, with P@1 peaking at 26.6% and 17.8%, respectively, even for VLM2GeoVec. The mapping of fine-grained spatial (bounding boxes, coordinates) input to a single vector embedding remains difficult.
Compositional and Contextual Reasoning: Tasks such as rCIR (region-based compositional retrieval) are challenging, requiring simultaneous spatial grounding and contextual editing.
Dual Encoder Limitations: Late-fusion dual-encoder architectures perform substantially worse than interleaved single-encoder models on region-level tasks.
Geo-coordinate Handling: Current tokenization of coordinates as text tuples is suboptimal; future research should develop continuous geodesic embeddings for improved geo-localization.
Extensibility: Suggestions include richer spatial encodings (learned map projections, topographic embeddings), integration of multi-sensor/temporal data, and unified pretraining that jointly optimizes contrastive and generative objectives.

RSMEB establishes a unified, ranking-based protocol for evaluating the ability of embeddings to capture both semantic and spatial/geographic relationships in remote sensing, enabling consistent evaluation across 21 diverse tasks and setting the stage for next-generation RS multimodal models (Aimar et al., 12 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

VLM2GeoVec: Toward Universal Multimodal Embeddings for Remote Sensing (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to RSMEB.