Papers
Topics
Authors
Recent
2000 character limit reached

RSMEB: Remote Sensing Multimodal Benchmark

Updated 19 December 2025
  • RSMEB is a unified, large-scale evaluation suite that assesses remote sensing multimodal embeddings by integrating semantic content with spatial/geographic reasoning.
  • It employs a ranking-based protocol across 21 tasks—including classification, retrieval, VQA, and geo-localization—to benchmark single-encoder models without post-training.
  • Baseline comparisons show significant P@1 improvements with models like VLM2GeoVec, highlighting both achievements and challenges in precise spatial localization.

Remote-Sensing Multimodal Embedding Benchmark (RSMEB) is a unified, large-scale evaluation suite specifically designed to quantify the capabilities of multimodal embedding models on a diverse range of remote sensing (RS) tasks. It was introduced to address the absence of a single benchmark that jointly evaluates “what” (semantic content) and “where” (spatial/geographical reasoning) performance in remote sensing, integrating both region-level spatial cues and holistic scene understanding (Aimar et al., 12 Dec 2025).

1. Motivation and Design Principles

RSMEB fills a critical evaluation gap. Traditional RS benchmarks are narrowly specialized (scene classification, retrieval, VQA, or object detection), while generic vision–language benchmarks lack RS-relevant modality cues such as geographic coordinates and bounding boxes. RSMEB adopts a ranking-based protocol for all tasks, enabling single-encoder models to be evaluated across classification, retrieval, grounding, visual question answering (VQA), and geo-localization. The benchmark explicitly incorporates both region-level (bounding box) and geospatial (latitude/longitude) reasoning to test fine-grained “where” understanding. All evaluations are zero-shot—models are not trained on any RSMEB subset post-pretraining.

2. Task Structure and Modalities

RSMEB spans 21 individual tasks grouped into six meta-tasks:

  1. Scene Classification: Rank N class-name candidates for a given satellite image and instruction.
  2. Cross-Modal Retrieval: Image-to-text (I→T) and text-to-image (T→I) retrieval tasks across datasets such as RSITMD, RSICD, and UCM-caption.
  3. Compositional Retrieval (rCIR): Retrieve a full-scene image conditioned on a region crop and a text modifier.
  4. Visual Question Answering: Multiple-choice presence, comparison, and rural/urban queries over images.
  5. Visual Grounding & Region Reasoning: Includes
    • Referring-Expression Retrieval (RefExp): rank image-crop candidates given an image and a referring expression.
    • Region-Caption Retrieval (RegCap): rank text-caption candidates for a given image and bounding box.
    • Grounded T2I (GrT2I): rank full-scene images given text and bounding boxes.
  6. Semantic Geospatial Retrieval (GeoT2I): Retrieve the matching image given text and (latitude, longitude) input.

All task queries are interleaved token streams of images, text, bounding boxes (as four normalized scalars), and geo-coordinates (as serialized text tuple "(lat, lon)"). The target pool consists of class labels, captions, image crops, or full-scene images, depending on the application.

3. Datasets and Preprocessing

RSMEB aggregates public RS datasets—AID, Million-AID, RSI-CB, EuroSAT, UCMerced, PatternNet (scene classification); RSITMD, RSICD, UCM-caption (retrieval); LRBEN/HRBEN (VQA); and proprietary splits for rCIR, RefExp, RegCap, GrT2I, and GeoT2I—without retraining. Test set sizes include:

Dataset/Task # Queries # Candidates/Classes
AID 2,000 30 classes
Million-AID 10,000 51 classes
RSI-CB 24,747 35 classes
RSITMD variable variable
rCIR 1,818 1,115 full images
RefExp, GeoT2I 2,000 2,000 candidates

All images are resampled (e.g., 336×336), tokenized with a Vision Transformer (ViT-L/14 at 14×14 patches), and regions/crops are processed at native resolution then rescaled.

4. Evaluation Protocols and Metrics

Every task is cast as a ranking or retrieval problem. Let NN be the number of queries.

  • Classification Accuracy:

Accuracy=1Ni=1N1(c^i=ci)\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\hat c_i = c_i)

where c^i\hat c_i is the top-1 prediction.

  • Retrieval Recall@K:

$\mathrm{Recall@}K = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}(\mathrm{gt}_i\in \mathrm{top\-}K)$

Averages reported over R@1, R@5, R@10 for cross-modal retrieval.

  • Precision@1 (P@1) for single-correct target tasks (e.g., RegCap, RefExp, rCIR, GrT2I, GeoT2I, VQA):

P@1=1Ni=1N1(t^i=ti)\mathrm{P@1} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\hat t_i = t_i^\ast)

where t^i\hat t_i is top-1 prediction.

Contrastive pretraining is measured by InfoNCE loss.

5. Baseline Comparisons and Results

RSMEB evaluations compare general VLMs (CLIP ViT-L/14, VLM2Vec) and remote sensing-specialized dual encoders (RemoteCLIP, SkyCLIP, GeoRSCLIP) with the single-encoder VLM2GeoVec. Key findings include:

Meta-task/Task Baseline Best VLM2GeoVec (P@1, %) Absolute Δ (pp)
Region-Caption Retrieval (RegCap) 1.25 (dual encoder) 26.56 +25.31
Referring-Expression (RefExp) 13.15 (CLIP) 32.50 +19.35
Semantic Geo-T2I 5.10 (SkyCLIP) 17.80 +12.70
Region-CIR (rCIR) 3.96 (SkyCLIP) 22.99 +19.03
Grounded T2I (GrT2I) 0.98 (CLIP) 13.70 +12.72
Cross-modal Retrieval (avg. R@K) RemoteCLIP 45.2 VLM2GeoVec 37.2 (2nd place)
Scene Classification (avg. acc.) VLM2GeoVec 66.1 VLM2GeoVec 66.1 Top performer
VQA (avg P@1, LRBEN/HRBEN) GeoChat 83.4, VLM2GeoVec 83.1 83.1 Top embedder

Precision@1 improvements on region-level and geo-localization tasks are +12–25 percentage points over prior state-of-the-art, with the VLM2GeoVec obtaining the leading Friedman ranking score (1.93) out of all tested models (Aimar et al., 12 Dec 2025).

6. Challenges, Insights, and Future Directions

  • Spatial Localization Bottleneck: Region-caption, referring expression, and grounded T2I tasks remain the most challenging, with P@1 peaking at 26.6% and 17.8%, respectively, even for VLM2GeoVec. The mapping of fine-grained spatial (bounding boxes, coordinates) input to a single vector embedding remains difficult.
  • Compositional and Contextual Reasoning: Tasks such as rCIR (region-based compositional retrieval) are challenging, requiring simultaneous spatial grounding and contextual editing.
  • Dual Encoder Limitations: Late-fusion dual-encoder architectures perform substantially worse than interleaved single-encoder models on region-level tasks.
  • Geo-coordinate Handling: Current tokenization of coordinates as text tuples is suboptimal; future research should develop continuous geodesic embeddings for improved geo-localization.
  • Extensibility: Suggestions include richer spatial encodings (learned map projections, topographic embeddings), integration of multi-sensor/temporal data, and unified pretraining that jointly optimizes contrastive and generative objectives.

RSMEB establishes a unified, ranking-based protocol for evaluating the ability of embeddings to capture both semantic and spatial/geographic relationships in remote sensing, enabling consistent evaluation across 21 diverse tasks and setting the stage for next-generation RS multimodal models (Aimar et al., 12 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to RSMEB.