GeoCrossBench: Cross-Sensor Benchmark
- GeoCrossBench is a cross-sensor remote sensing benchmark that systematically evaluates computer vision models' generalization across familiar and novel sensor bands.
- It introduces three protocols— in-distribution, zero-shot transfer with non-overlapping bands, and superset bands— to rigorously test model robustness under varied spectral inputs.
- Experimental results show that architectures like ChiViT demonstrate superior cross-band robustness, with minimal performance drops compared to standard models.
GeoCrossBench is a cross-sensor remote sensing benchmark and evaluation framework designed to quantify and advance the cross-band generalization capabilities of computer vision models in Earth observation. It extends the established GeoBench suite by introducing protocols that systematically test the ability of models to handle both familiar and unfamiliar sensor bands, reflecting the growing diversity of satellite imagery sources and the critical need for foundation models that are future-proof against new satellite deployments [2511.02831].
1. Motivation and Scope
GeoCrossBench addresses the challenge where labeled Earth observation data predominantly originates from legacy satellites, while new satellite missions may have different spectral coverage or additional/higher-dimensional bands. As foundation models for remote sensing scale, re-training or adaptation costs rise, necessitating robust evaluation of models’ cross-sensor transfer and their performance when faced with spectrally mismatched or supersets of inputs. GeoCrossBench thus establishes a standard for three critical settings: in-distribution performance (matched bands), zero-shot transfer to novel (non-overlapping) bands, and evaluation with additional (superset) bands at inference time.
2. Evaluation Protocol
GeoCrossBench comprises three primary evaluation settings to capture scenarios representative of real-world satellite variation:
In-Distribution (ID) Setting: Models are trained and evaluated on the same spectral bands, establishing an upper-bound performance. Specifically, two configurations are provided: fine-tuning on RGB bands of Sentinel-2 (B4, B3, B2) and on all 10 Sentinel-2 bands at ≤20 m resolution (B2–B12).
No-Overlap Bands Setting: Models are fine-tuned on a set of source bands (e.g., S2-RGB or S2-10) and evaluated on completely non-overlapping target bands—including transfer from RGB to Sentinel-1 (dual-polarization SAR: VV, VH) and from RGB to non-overlapping Sentinel-2 IR bands (B8A, B11, B12). This rigorously quantifies cross-sensor generalization in the absence of shared spectral information.
Superset Bands Setting: Models are trained on a subset of bands and evaluated on a superset containing strictly more channels, such as fine-tuning on S2-RGB and evaluating on RGBN (adds NIR B8), or training on 10-band S2 and evaluating on fused 12-band S2+S1 data.
For any metric $M$,
[
\Delta = M_{\text{train-setting}} - M_{\text{test-setting}}
]
[
\mathrm{Drop}\% = \frac{M_{\text{in\,dist}} - M_{\text{generalize}}}{M_{\text{in\,dist}}} \times 100\%
]
3. Datasets and Task Splits
GeoCrossBench builds on the GeoBench core scenes, extending each sample with Sentinel-1 SAR bands, yielding fused 12-band stacks for every example. Datasets span land use classification, semantic segmentation, and change detection:
| Task | Dataset | Classes | Input Size | Train | Val | Test | Main Metric |
|---|---|---|---|---|---|---|---|
| Scene classification | x-bigearthnet | 43 | 120×120 | 20k | 1k | 1k | F1Score |
| "" | x-so2sat | 17 | 32×32 | 19,992 | 986 | 986 | Accuracy |
| "" | x-brick-kiln | 2 | 64×64 | 15,063 | 999 | 999 | Accuracy |
| "" | x-eurosat | 10 | 64×64 | 2k | 1k | 1k | Accuracy |
| Semantic segm. | x-cashew-plantation | 7 | 256/512×256/512 | 1,350 | 400 | 50 | mIoU |
| "" | x-SA-crop-type | 10 | 256/512×256/512 | 3k | 1k | 1k | mIoU |
| "" | x-harvey-building | 2 | 256/512×256/512 | 375 | 94 | 461 | bIoU (minority) |
| "" | x-sen1floods11 | 2 | 512×512 | 252 | 89 | 90 | mIoU |
| Change detection | x-harvey-flood | 2 | 256×256 | 375 | 94 | 461 | bIoU (minority) |
| "" | x-oscd | 2 | 224×224 (city) | 24 c. | 14 c. | 10 c. | F1Score |
Each split is fixed, and only the band composition is varied between evaluation settings.
4. Metrics and Evaluation
GeoCrossBench applies canonical remote sensing metrics by task:
Top-1 Accuracy:
[
\mathrm{Acc} = \frac{1}{N} \sum_{i=1}N \mathbf{1}{\hat{y}_i = y_i}
]F1 Score (for classification/change detection):
[
\mathrm{Precision} = \frac{TP}{TP + FP},\;\; \mathrm{Recall} = \frac{TP}{TP + FN},\;\; F1 = 2\frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}
]Mean Intersection-over-Union (mIoU) for segmentation:
[
\mathrm{IoU}c = \frac{TP_c}{TP_c + FP_c + FN_c},\quad \mathrm{mIoU} = \frac{1}{C}\sum{c=1}{C}\mathrm{IoU}_c
]Binary IoU (bIoU): mIoU, but computed for the minority (change) class only.
Performance is reported for each model, downstream task, and evaluation protocol (ID, No-Overlap, Superset).
5. Model Suite and ChiViT Architecture
GeoCrossBench evaluates:
- Remote sensing foundation models: TerraFM, DOFA, SatlasNet, CROMA, AnySat, Prithvi (all ViT-B or Swin-B, <100M params).
- General-purpose self-supervised vision models: iBOT, DINOv2-ViT-B, DINOv3-ViT-B.
- Standard supervised baselines: ResNet-50, ViT-B.
- ChiViT: A ChannelViT variant with iBOT-style multi-view self-distillation and masked image modeling (MIM) for channel-robust remote sensing.
ChiViT Details:
- Input $\mathbf{x} \in \mathbb{R}{C \times H \times W}$, split so each of $C$ channels independently forms Transformer tokens using a shared patch embedding, channel embedding $e_c\mathrm{chn}$, and positional embedding $e_j\mathrm{pos}$:
[
\mathrm{token}{c, j} = W x{c, j} + e_j\mathrm{pos} + e{\mathrm{chn}}_c
]
- Hierarchical channel sampling: the "teacher" attends to all channels; the "student" receives randomly masked channels, forcing robustness to missing or novel bands.
- Multi-view self-distillation across global (G$1$, G$_2$) and local (L$_i$) crops, plus standard MIM loss:
[
L = L{\mathrm{distill}(\mathrm{CLS})} + L_{\mathrm{mim}}
]
- Trained on $\sim$23M remote sensing images from multiple large-scale datasets, covering $\sim$400M samples.
6. Key Experimental Findings
6.1 In-Distribution (ID)
- Remote sensing–tailored models (DOFA, TerraFM) do not consistently outperform general-purpose DINOv3.
- Classification accuracy/F1 (across all datasets): DINOv3 85–88%, TerraFM 83–86%, DOFA 82–85%, ChiViT 84–87%.
6.2 No-Overlap Bands
- All models experience a 2–4$\times$ performance drop (e.g., 85% $\rightarrow$ 30–45%).
- ChiViT outperforms DINOv3 by 5–8 percentage points on F1/Accuracy in average cross-band scenarios.
6.3 Superset Bands
- Adding extra test bands leads to 5–25% performance drop for all models, indicating overfitting to the training channel configuration.
- ChiViT exhibits the smallest drop (5–10%), others degrade up to 25%.
6.4 Fine-tuning and Linear Probes
- Full fine-tuning of all weights improves accuracy over linear probes, except in some No-Overlap setups where frozen backbones and linear heads are competitive.
- Linear probes trained on labels for all bands boost S1 test accuracy by 10–15 points (ChiViT, DINOv2, DINOv3, TerraFM) compared to training on RGB/S2 only, with only 1–3 point loss on RGB/S2. This demonstrates the value of "oracle" multi-band labels and indicates the benchmark is not yet saturated.
7. Implications and Future Prospects
Current remote sensing foundation models do not yet surpass large general-purpose vision models on in-distribution tasks. However, explicit architectural designs and training strategies that introduce band perturbation—such as channel sampling (ChiViT) and use of parallel, multi-sensor data—yield substantial improvements in cross-band generalization.
The consistent performance drop when facing additional or mismatched bands highlights the models' sensitivity to channel configurations, suggesting a need for architectural or loss modifications that promote invariance to input channel set.
Further, the ability to boost performance with lightweight linear probes on multi-band data implies that foundation backbones do encode substantial cross-sensor information, but extraction remains suboptimal. A plausible implication is that continued expansion of pretraining datasets (especially with multi-sensor, "parallel image" alignment), architectural scale-up, and knowledge transfer via distillation will advance the field toward robust, future-proof remote sensing models.
GeoCrossBench datasets and code are public, enabling the community to benchmark progress and develop new models under transparent, realistic generalization regimes [2511.02831].