RSMEB: Remote Sensing Multimodal Benchmark
- RSMEB is a unified, large-scale evaluation suite that assesses remote sensing multimodal embeddings by integrating semantic content with spatial/geographic reasoning.
- It employs a ranking-based protocol across 21 tasks—including classification, retrieval, VQA, and geo-localization—to benchmark single-encoder models without post-training.
- Baseline comparisons show significant P@1 improvements with models like VLM2GeoVec, highlighting both achievements and challenges in precise spatial localization.
Remote-Sensing Multimodal Embedding Benchmark (RSMEB) is a unified, large-scale evaluation suite specifically designed to quantify the capabilities of multimodal embedding models on a diverse range of remote sensing (RS) tasks. It was introduced to address the absence of a single benchmark that jointly evaluates “what” (semantic content) and “where” (spatial/geographical reasoning) performance in remote sensing, integrating both region-level spatial cues and holistic scene understanding (Aimar et al., 12 Dec 2025).
1. Motivation and Design Principles
RSMEB fills a critical evaluation gap. Traditional RS benchmarks are narrowly specialized (scene classification, retrieval, VQA, or object detection), while generic vision–language benchmarks lack RS-relevant modality cues such as geographic coordinates and bounding boxes. RSMEB adopts a ranking-based protocol for all tasks, enabling single-encoder models to be evaluated across classification, retrieval, grounding, visual question answering (VQA), and geo-localization. The benchmark explicitly incorporates both region-level (bounding box) and geospatial (latitude/longitude) reasoning to test fine-grained “where” understanding. All evaluations are zero-shot—models are not trained on any RSMEB subset post-pretraining.
2. Task Structure and Modalities
RSMEB spans 21 individual tasks grouped into six meta-tasks:
- Scene Classification: Rank N class-name candidates for a given satellite image and instruction.
- Cross-Modal Retrieval: Image-to-text (I→T) and text-to-image (T→I) retrieval tasks across datasets such as RSITMD, RSICD, and UCM-caption.
- Compositional Retrieval (rCIR): Retrieve a full-scene image conditioned on a region crop and a text modifier.
- Visual Question Answering: Multiple-choice presence, comparison, and rural/urban queries over images.
- Visual Grounding & Region Reasoning: Includes
- Referring-Expression Retrieval (RefExp): rank image-crop candidates given an image and a referring expression.
- Region-Caption Retrieval (RegCap): rank text-caption candidates for a given image and bounding box.
- Grounded T2I (GrT2I): rank full-scene images given text and bounding boxes.
- Semantic Geospatial Retrieval (GeoT2I): Retrieve the matching image given text and (latitude, longitude) input.
All task queries are interleaved token streams of images, text, bounding boxes (as four normalized scalars), and geo-coordinates (as serialized text tuple "(lat, lon)"). The target pool consists of class labels, captions, image crops, or full-scene images, depending on the application.
3. Datasets and Preprocessing
RSMEB aggregates public RS datasets—AID, Million-AID, RSI-CB, EuroSAT, UCMerced, PatternNet (scene classification); RSITMD, RSICD, UCM-caption (retrieval); LRBEN/HRBEN (VQA); and proprietary splits for rCIR, RefExp, RegCap, GrT2I, and GeoT2I—without retraining. Test set sizes include:
| Dataset/Task | # Queries | # Candidates/Classes |
|---|---|---|
| AID | 2,000 | 30 classes |
| Million-AID | 10,000 | 51 classes |
| RSI-CB | 24,747 | 35 classes |
| RSITMD | variable | variable |
| rCIR | 1,818 | 1,115 full images |
| RefExp, GeoT2I | 2,000 | 2,000 candidates |
All images are resampled (e.g., 336×336), tokenized with a Vision Transformer (ViT-L/14 at 14×14 patches), and regions/crops are processed at native resolution then rescaled.
4. Evaluation Protocols and Metrics
Every task is cast as a ranking or retrieval problem. Let be the number of queries.
- Classification Accuracy:
where is the top-1 prediction.
- Retrieval Recall@K:
$\mathrm{Recall@}K = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}(\mathrm{gt}_i\in \mathrm{top\-}K)$
Averages reported over R@1, R@5, R@10 for cross-modal retrieval.
- Precision@1 (P@1) for single-correct target tasks (e.g., RegCap, RefExp, rCIR, GrT2I, GeoT2I, VQA):
where is top-1 prediction.
Contrastive pretraining is measured by InfoNCE loss.
5. Baseline Comparisons and Results
RSMEB evaluations compare general VLMs (CLIP ViT-L/14, VLM2Vec) and remote sensing-specialized dual encoders (RemoteCLIP, SkyCLIP, GeoRSCLIP) with the single-encoder VLM2GeoVec. Key findings include:
| Meta-task/Task | Baseline Best | VLM2GeoVec (P@1, %) | Absolute Δ (pp) |
|---|---|---|---|
| Region-Caption Retrieval (RegCap) | 1.25 (dual encoder) | 26.56 | +25.31 |
| Referring-Expression (RefExp) | 13.15 (CLIP) | 32.50 | +19.35 |
| Semantic Geo-T2I | 5.10 (SkyCLIP) | 17.80 | +12.70 |
| Region-CIR (rCIR) | 3.96 (SkyCLIP) | 22.99 | +19.03 |
| Grounded T2I (GrT2I) | 0.98 (CLIP) | 13.70 | +12.72 |
| Cross-modal Retrieval (avg. R@K) | RemoteCLIP 45.2 | VLM2GeoVec 37.2 | (2nd place) |
| Scene Classification (avg. acc.) | VLM2GeoVec 66.1 | VLM2GeoVec 66.1 | Top performer |
| VQA (avg P@1, LRBEN/HRBEN) | GeoChat 83.4, VLM2GeoVec 83.1 | 83.1 | Top embedder |
Precision@1 improvements on region-level and geo-localization tasks are +12–25 percentage points over prior state-of-the-art, with the VLM2GeoVec obtaining the leading Friedman ranking score (1.93) out of all tested models (Aimar et al., 12 Dec 2025).
6. Challenges, Insights, and Future Directions
- Spatial Localization Bottleneck: Region-caption, referring expression, and grounded T2I tasks remain the most challenging, with P@1 peaking at 26.6% and 17.8%, respectively, even for VLM2GeoVec. The mapping of fine-grained spatial (bounding boxes, coordinates) input to a single vector embedding remains difficult.
- Compositional and Contextual Reasoning: Tasks such as rCIR (region-based compositional retrieval) are challenging, requiring simultaneous spatial grounding and contextual editing.
- Dual Encoder Limitations: Late-fusion dual-encoder architectures perform substantially worse than interleaved single-encoder models on region-level tasks.
- Geo-coordinate Handling: Current tokenization of coordinates as text tuples is suboptimal; future research should develop continuous geodesic embeddings for improved geo-localization.
- Extensibility: Suggestions include richer spatial encodings (learned map projections, topographic embeddings), integration of multi-sensor/temporal data, and unified pretraining that jointly optimizes contrastive and generative objectives.
RSMEB establishes a unified, ranking-based protocol for evaluating the ability of embeddings to capture both semantic and spatial/geographic relationships in remote sensing, enabling consistent evaluation across 21 diverse tasks and setting the stage for next-generation RS multimodal models (Aimar et al., 12 Dec 2025).