GeoCrossBench: Cross-Sensor Benchmark

Updated 13 April 2026

GeoCrossBench is a cross-sensor remote sensing benchmark that systematically evaluates computer vision models' generalization across familiar and novel sensor bands.
It introduces three protocols— in-distribution, zero-shot transfer with non-overlapping bands, and superset bands— to rigorously test model robustness under varied spectral inputs.
Experimental results show that architectures like ChiViT demonstrate superior cross-band robustness, with minimal performance drops compared to standard models.

GeoCrossBench is a cross-sensor remote sensing benchmark and evaluation framework designed to quantify and advance the cross-band generalization capabilities of computer vision models in Earth observation. It extends the established GeoBench suite by introducing protocols that systematically test the ability of models to handle both familiar and unfamiliar sensor bands, reflecting the growing diversity of satellite imagery sources and the critical need for foundation models that are future-proof against new satellite deployments [2511.02831].

1. Motivation and Scope

GeoCrossBench addresses the challenge where labeled Earth observation data predominantly originates from legacy satellites, while new satellite missions may have different spectral coverage or additional/higher-dimensional bands. As foundation models for remote sensing scale, re-training or adaptation costs rise, necessitating robust evaluation of models’ cross-sensor transfer and their performance when faced with spectrally mismatched or supersets of inputs. GeoCrossBench thus establishes a standard for three critical settings: in-distribution performance (matched bands), zero-shot transfer to novel (non-overlapping) bands, and evaluation with additional (superset) bands at inference time.

2. Evaluation Protocol

GeoCrossBench comprises three primary evaluation settings to capture scenarios representative of real-world satellite variation:

In-Distribution (ID) Setting: Models are trained and evaluated on the same spectral bands, establishing an upper-bound performance. Specifically, two configurations are provided: fine-tuning on RGB bands of Sentinel-2 (B4, B3, B2) and on all 10 Sentinel-2 bands at ≤20 m resolution (B2–B12).
No-Overlap Bands Setting: Models are fine-tuned on a set of source bands (e.g., S2-RGB or S2-10) and evaluated on completely non-overlapping target bands—including transfer from RGB to Sentinel-1 (dual-polarization SAR: VV, VH) and from RGB to non-overlapping Sentinel-2 IR bands (B8A, B11, B12). This rigorously quantifies cross-sensor generalization in the absence of shared spectral information.
Superset Bands Setting: Models are trained on a subset of bands and evaluated on a superset containing strictly more channels, such as fine-tuning on S2-RGB and evaluating on RGBN (adds NIR B8), or training on 10-band S2 and evaluating on fused 12-band S2+S1 data.

For any metric $M$,

[
\Delta = M_{\text{train-setting}} - M_{\text{test-setting}}
]
[
\mathrm{Drop}\% = \frac{M_{\text{in\,dist}} - M_{\text{generalize}}}{M_{\text{in\,dist}}} \times 100\%
]

3. Datasets and Task Splits

GeoCrossBench builds on the GeoBench core scenes, extending each sample with Sentinel-1 SAR bands, yielding fused 12-band stacks for every example. Datasets span land use classification, semantic segmentation, and change detection:

Task	Dataset	Classes	Input Size	Train	Val	Test	Main Metric
Scene classification	x-bigearthnet	43	120×120	20k	1k	1k	F1Score
""	x-so2sat	17	32×32	19,992	986	986	Accuracy
""	x-brick-kiln	2	64×64	15,063	999	999	Accuracy
""	x-eurosat	10	64×64	2k	1k	1k	Accuracy
Semantic segm.	x-cashew-plantation	7	256/512×256/512	1,350	400	50	mIoU
""	x-SA-crop-type	10	256/512×256/512	3k	1k	1k	mIoU
""	x-harvey-building	2	256/512×256/512	375	94	461	bIoU (minority)
""	x-sen1floods11	2	512×512	252	89	90	mIoU
Change detection	x-harvey-flood	2	256×256	375	94	461	bIoU (minority)
""	x-oscd	2	224×224 (city)	24 c.	14 c.	10 c.	F1Score

Each split is fixed, and only the band composition is varied between evaluation settings.

4. Metrics and Evaluation

GeoCrossBench applies canonical remote sensing metrics by task:

Top-1 Accuracy:
[
\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}{\hat{y}_i = y_i}
]
F1 Score (for classification/change detection):
[
\mathrm{Precision} = \frac{TP}{TP + FP},\;\; \mathrm{Recall} = \frac{TP}{TP + FN},\;\; F1 = 2\frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}
]
Mean Intersection-over-Union (mIoU) for segmentation:
[
\mathrm{IoU}c = \frac{TP_c}{TP_c + FP_c + FN_c},\quad \mathrm{mIoU} = \frac{1}{C}\sum{c=1}^{{C}\mathrm{IoU}_c}
]
Binary IoU (bIoU): mIoU, but computed for the minority (change) class only.

Performance is reported for each model, downstream task, and evaluation protocol (ID, No-Overlap, Superset).

5. Model Suite and ChiViT Architecture

GeoCrossBench evaluates:

Remote sensing foundation models: TerraFM, DOFA, SatlasNet, CROMA, AnySat, Prithvi (all ViT-B or Swin-B, <100M params).
General-purpose self-supervised vision models: iBOT, DINOv2-ViT-B, DINOv3-ViT-B.
Standard supervised baselines: ResNet-50, ViT-B.
ChiViT: A ChannelViT variant with iBOT-style multi-view self-distillation and masked image modeling (MIM) for channel-robust remote sensing.

ChiViT Details:
- Input $\mathbf{x} \in \mathbb{R}^{C \times H \times W}$, split so each of $C$ channels independently forms Transformer tokens using a shared patch embedding, channel embedding $e_c^{\mathrm{chn}$,} and positional embedding $e_j^{\mathrm{pos}$:}
[
\mathrm{token}{c, j} = W x{c, j} + e_j^\mathrm{pos} + e^{{\mathrm{chn}}_c}
]
- Hierarchical channel sampling: the "teacher" attends to all channels; the "student" receives randomly masked channels, forcing robustness to missing or novel bands.
- Multi-view self-distillation across global (G$1$, G$_2$) and local (L$_i$) crops, plus standard MIM loss:
[
L = L{\mathrm{distill}(\mathrm{CLS})} + L_{\mathrm{mim}}
]
- Trained on $\sim$23M remote sensing images from multiple large-scale datasets, covering $\sim$400M samples.

6. Key Experimental Findings

6.1 In-Distribution (ID)

Remote sensing–tailored models (DOFA, TerraFM) do not consistently outperform general-purpose DINOv3.
Classification accuracy/F1 (across all datasets): DINOv3 85–88%, TerraFM 83–86%, DOFA 82–85%, ChiViT 84–87%.

6.2 No-Overlap Bands

All models experience a 2–4$\times$ performance drop (e.g., 85% $\rightarrow$ 30–45%).
ChiViT outperforms DINOv3 by 5–8 percentage points on F1/Accuracy in average cross-band scenarios.

6.3 Superset Bands

Adding extra test bands leads to 5–25% performance drop for all models, indicating overfitting to the training channel configuration.
ChiViT exhibits the smallest drop (5–10%), others degrade up to 25%.

6.4 Fine-tuning and Linear Probes

Full fine-tuning of all weights improves accuracy over linear probes, except in some No-Overlap setups where frozen backbones and linear heads are competitive.
Linear probes trained on labels for all bands boost S1 test accuracy by 10–15 points (ChiViT, DINOv2, DINOv3, TerraFM) compared to training on RGB/S2 only, with only 1–3 point loss on RGB/S2. This demonstrates the value of "oracle" multi-band labels and indicates the benchmark is not yet saturated.

7. Implications and Future Prospects

Current remote sensing foundation models do not yet surpass large general-purpose vision models on in-distribution tasks. However, explicit architectural designs and training strategies that introduce band perturbation—such as channel sampling (ChiViT) and use of parallel, multi-sensor data—yield substantial improvements in cross-band generalization.

The consistent performance drop when facing additional or mismatched bands highlights the models' sensitivity to channel configurations, suggesting a need for architectural or loss modifications that promote invariance to input channel set.

Further, the ability to boost performance with lightweight linear probes on multi-band data implies that foundation backbones do encode substantial cross-sensor information, but extraction remains suboptimal. A plausible implication is that continued expansion of pretraining datasets (especially with multi-sensor, "parallel image" alignment), architectural scale-up, and knowledge transfer via distillation will advance the field toward robust, future-proof remote sensing models.

GeoCrossBench datasets and code are public, enabling the community to benchmark progress and develop new models under transparent, realistic generalization regimes [2511.02831].

Markdown Report Issue Upgrade to Chat

References (1)

GeoCrossBench: Cross-Band Generalization for Remote Sensing (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GeoCrossBench.