Geospatial Foundation Models Overview

Updated 27 September 2025

Geospatial foundation models are pre-trained, task-agnostic AI systems that extract universal representations from diverse, multi-modal geospatial data.
They use self-supervised objectives and advanced architectures, such as Vision Transformers, to enable efficient adaptation across Earth observation tasks.
Practical applications include change detection, land cover classification, and environmental monitoring, advancing sustainable development and resource management.

Geospatial foundation models are large-scale, pre-trained artificial intelligence systems that extract high-capacity, transferable representations from geospatial data—primarily satellite and aerial remote sensing imagery, but also extending to vector data, maps, structured tables, and text. By leveraging modern self-supervised and multimodal pretraining strategies on massive, diverse, and often multi-temporal datasets, these models enable scalable and data-efficient adaptation to a broad spectrum of Earth observation (EO) and geospatial analysis tasks. Their emergence marks a paradigm shift in geospatial artificial intelligence (GeoAI), analogous to the impact of foundation models in NLP and computer vision. The following sections address technical developments, benchmarking, architecture and pretraining, data composition, adaptation strategies, applications, and future challenges surrounding geospatial foundation models.

1. Definition, Rationale, and Core Principles

Geospatial foundation models (GFMs) are task-agnostic models pre-trained on large-scale, heterogeneous geospatial datasets with the goal of learning universal spatial, temporal, and semantic representations. They can be efficiently fine-tuned or adapted—sometimes with minimal labeled data—across a range of downstream tasks such as semantic segmentation, change detection, land cover classification, image retrieval, aboveground biomass estimation, and even geospatial reasoning over vector data (Mendieta et al., 2023, Jakubik et al., 2023, Han et al., 1 Apr 2024, Blumenstiel et al., 4 Mar 2024, Ghamisi et al., 30 May 2025, Hsu et al., 31 Aug 2024).

The design of GFMs is informed by several key requirements unique to geospatial data:

Heterogeneity of data: Inputs include multi-spectral (beyond RGB), multi-sensor (optical, SAR, DSM), variable-resolution, and often spatiotemporally-indexed data (Han et al., 1 Apr 2024).
Scale and coverage: Modern GFMs are trained on terabyte- and petabyte-scale archives, often spanning the globe both geographically and temporally (Jakubik et al., 2023, Szwarcman et al., 3 Dec 2024).
Transferability and scalability: Foundation models allow "pretrain once, adapt many," supporting few-shot or zero-shot learning, cross-region transfer, and flexible deployment (Mendieta et al., 2023, Ghamisi et al., 30 May 2025).
Modal fusion and alignment: Advanced GFMs integrate multiple modalities (e.g., imagery, text, vector data, environmental measurements, behavioral signals) through appropriate architectural and loss design (Han et al., 1 Apr 2024, Agarwal et al., 11 Nov 2024).

GFMs are typically pre-trained via self-supervised objectives (e.g., masked autoencoders, contrastive learning) and often use transformer backbones with adaptations for the spatial, spectral, and temporal properties of remote sensing data.

2. Model Architectures, Pretraining Strategies, and Data Design

Model architectures for GFMs commonly use variants of the Vision Transformer (ViT) paradigm, extended to process multi-spectral, multi-temporal, and multi-modal data (Jakubik et al., 2023, Szwarcman et al., 3 Dec 2024, Han et al., 1 Apr 2024). Notable innovations include:

3D Patch Embeddings: Prithvi-EO-2.0, for instance, replaces 2D spatial embeddings with 3D embeddings to accommodate spatiotemporal cubes; the model incorporates both explicit temporal and geolocation tokens that are weighted and summed into the main token stream:

$E_{\text{final}} = E + w_{\text{time}} \cdot E_{\text{time}} + w_{\text{loc}} \cdot E_{\text{loc}}$

(Szwarcman et al., 3 Dec 2024)

Cross-Sensor Pretraining: msGFM learns joint representations across paired and unpaired data through masked image modeling and cross-sensor reconstruction losses, harmonizing embeddings across diverse sensor domains (e.g., optical, SAR, DSM) (Han et al., 1 Apr 2024).
Teacher–Student and Continual Pretraining: Hybrid approaches (e.g., continual pretraining from ImageNet via multi-objective teacher–student frameworks) accelerate adaptation of strong vision backbones to geospatial domains while minimizing computational cost (Mendieta et al., 2023).
Self-Supervised Objectives: Masked image modeling (MAE) dominates, with a typical pretraining loss:

$L = \frac{1}{N} \sum_{i=1}^N (x_i - \widehat{x}_i)^2$

where $N$ is the number of masked tokens (Jakubik et al., 2023, Szwarcman et al., 3 Dec 2024).

Graph Neural Networks: For structural and multi-modal geospatial inference (e.g., population and environmental embedding), GNNs such as GraphSAGE are used to aggregate over spatial and feature similarity graphs (Agarwal et al., 11 Nov 2024).

Regarding data, the construction of globally diverse, high-entropy datasets is critical. Balanced sampling strategies (e.g., stratified by continent or biome) yield better generalization and robustness, as confirmed by empirical studies showing up to 2% higher F1 in few-shot downstream tasks compared to regionally clustered sampling (Purohit et al., 21 Jan 2025).

3. Benchmarking, Evaluation, and Comparative Analysis

Comprehensive, standardized benchmarks are pivotal for rigorous assessment and robust comparison of GFMs:

Task and Domain Diversity: Benchmarks such as PANGAEA (Marsocci et al., 5 Dec 2024) and GEO-Bench (Szwarcman et al., 3 Dec 2024) span multiple spatial resolutions (0.1–30 m/pixel), sensor types (optical, SAR, multi-spectral), domains (urban, marine, agriculture, disaster), and tasks (segmentation, regression, change detection).
Supervised Baseline Comparison: Evaluation frameworks systematically compare foundation models to UNet and vanilla ViT baselines across dense-label and sparse-label tasks. In some cases, fully supervised baselines still outperform GFMs on simpler or low-resolution tasks, indicating that model superiority is context-dependent (Marsocci et al., 5 Dec 2024, Ghamisi et al., 30 May 2025).
Limited Label and Zero-Shot Settings: GFMs exhibit pronounced advantages under data-scarcity scenarios. For example, in the SustainFM benchmark, FMs match or exceed task-specific models across 16 SDG-relevant tasks, particularly excelling in transferability, generalization, and adaptation efficiency (Ghamisi et al., 30 May 2025).
Broader Metrics Beyond Accuracy: Modern evaluations emphasize transferability, data efficiency, convergence speed, robustness to domain shifts, and energy/carbon metrics (e.g., relative CO₂ emissions per training regime), rather than only accuracy or mean IoU (Szwarcman et al., 3 Dec 2024, Ghamisi et al., 30 May 2025, Mendieta et al., 2023).
Extensibility and Reproducibility: Codebases, detailed protocols, and plug-and-play evaluation strategies are increasingly released with major benchmarks to enable systematic progress and fair comparison (Marsocci et al., 5 Dec 2024, Szwarcman et al., 3 Dec 2024, Blumenstiel et al., 4 Mar 2024).

4. Adaptation, Domain Generalization, and Parameter-Efficient Fine-Tuning

Adapting GFMs to domain shifts (e.g., new geographies, novel sensors, multispectral or hyperspectral variation) and enabling efficient fine-tuning under resource constraints are central technical challenges:

Parameter-Efficient Fine-Tuning (PEFT): Techniques such as LoRA, Visual Prompt Tuning (VPT), and ViT Adapter enable adaptation by inserting small, learnable adapters (e.g., low-rank matrices or prompt tokens) into frozen backbones, often reaching or surpassing full fine-tuning in accuracy with 5–10× fewer parameters required (Marti-Escofet et al., 24 Apr 2025, Thoreau et al., 12 Mar 2025). For a given layer $W$ , LoRA writes:

$W' = W + AB,\quad A \in \mathbb{R}^{d \times r},\ B \in \mathbb{R}^{r \times d}$

significantly reducing new parameter count.

Embedding Deflection and Inductive Biases: DEFLECT exploits model/data-specific inductive bias to efficiently adapt RGB-pretrained models to multispectral inputs by “untangling” radiometric from spatial patch components and introducing specialized untangled attention modules (Thoreau et al., 12 Mar 2025).
Domain Generalization via Adapter Modules: Adapter layers, entropy-minimized soft pseudo-labeling, and masked autoencoding (from source to target) jointly enforce domain-invariant representation learning in scenarios with limited or absent labels for the target domain, as demonstrated for adaptation between multispectral and hyperspectral remote sensing data (Yaghmour et al., 2 May 2025).
Band Adaptation and Multi-Scale Feature Handling: Practical pipelines introduce strategies such as retrained patch embedding for band-mismatched input or auxiliary feature modules to overcome challenges of variable spectral channels and scale variance (Hsu et al., 31 Aug 2024).
Decoupling Encoder and Decoder: Freezing the encoder (foundation backbone) while only updating lightweight decoders enables adaptation with greatly reduced optimization overhead and carbon impact, with little to no loss in target performance (Muszynski et al., 28 Jun 2024, Jakubik et al., 2023).

5. Applications and Real-World Impact

GFMs are being applied to, and evaluated on, a spectrum of critical geospatial tasks:

Environmental Monitoring and Hazard Detection: High-precision flood mapping, wildfire scar segmentation, cloud gap imputation, and burn intensity mapping (Jakubik et al., 2023, Li et al., 2023, Szwarcman et al., 3 Dec 2024).
Agricultural and Land Use Analysis: Multi-temporal crop segmentation, land cover classification, asset wealth prediction, and cropland change detection (Szwarcman et al., 3 Dec 2024, Ghamisi et al., 30 May 2025).
Ecosystem and Carbon Cycle Monitoring: Biomass estimation utilizing transfer learning from foundation models, achieving competitive RMSE to state-of-the-art U-Net baselines with ~13× fewer tunable parameters (Muszynski et al., 28 Jun 2024).
Population, Socioeconomic, and Health Mapping: Foundation models aggregating multi-modal (search trends, maps, environmental) data via GNNs for health, poverty, and social variable estimation, with state-of-the-art interpolation and extrapolation accuracy (Agarwal et al., 11 Nov 2024).
Image Retrieval and Policy Support: Highly accurate multi-spectral image retrieval for disaster response and environmental monitoring, leveraging foundation model embeddings with extreme compression (32×) for rapid search applications (Blumenstiel et al., 4 Mar 2024).
Geospatial Reasoning and Semantic Inference: Application of LLMs to vector geometry (WKT) reasoning, mapping informal spatial language to formal topological relations, and inferring geometric predicates with accuracy >0.66 in topological Q&A (Ji et al., 22 May 2025).

SustainFM directly links benchmark tasks with 16 SDGs, signaling the growing influence of geospatial AI on sustainable development, including assessment of asset wealth, children’s health, renewable energy presence, and environmental hazards (Ghamisi et al., 30 May 2025).

6. Limitations, Benchmarks, and Future Directions

Several limitations and open challenges are identified:

Inconsistent Superiority: GFMs do not universally outperform supervised or task-specific models across all domains, especially when downstream data is low-complexity or low-resolution; domain adaptation and task-pipeline co-design remain open areas (Marsocci et al., 5 Dec 2024, Hsu et al., 31 Aug 2024).
Pre-Training Data Bias and Distribution: Geographic and environmental diversity in pre-training data is essential for robust, global performance and minimizing regional biases; the impact of data sampling strategies is architecture-dependent and remains an active area of investigation (Purohit et al., 21 Jan 2025).
Computational and Environmental Cost: Billion-scale and larger models require leadership-class HPC resources for training (e.g., Frontier supercomputer), but fine-tuning strategies that update only decoders or use PEFT greatly reduce energy and CO₂ footprint (more than 8× reduction in case studies) (Tsaris et al., 17 Apr 2024, Mendieta et al., 2023, Ghamisi et al., 30 May 2025).
Privacy and Security: Model training and deployment introduce risks related to memorization of sensitive geographical information, necessitating research in differential privacy, federated learning, and secure tooling protocols (Rao et al., 2023).
Synthetic Data and Modalities: Data scarcity can be mitigated via synthetic generation, multimodal alignment, and cross-sensor pretraining; handling complex data such as gridded temporal climate arrays or text-geospatial fusion remains challenging (Mai et al., 2023, Han et al., 1 Apr 2024, Jiang et al., 15 May 2025).
Evaluation and Open Science: Standardized benchmarks (PANGAEA, SustainFM) and open-source codebases are critical for transparent, reproducible, and globally relevant evaluation. The Trusted Open Science approach is emerging as a standard in leading GFMs (Marsocci et al., 5 Dec 2024, Szwarcman et al., 3 Dec 2024).

7. Outlook: Ethical, Societal, and Methodological Implications

GFMs have significant promise for supporting societal, environmental, and economic decision-making on a planetary scale. Their development and deployment merit focus on the following:

Energy Efficiency and Transparency: Reporting of energy and CO₂ metrics, adoption of energy-efficient adaptation methods, and minimizing unnecessary model retraining are recommended best practices (Ghamisi et al., 30 May 2025, Mendieta et al., 2023).
Ethical and Bias Reduction: Data rebalancing, careful curation, and proactive bias assessment are required to address geographic, demographic, and sensor-induced disparities (Ghamisi et al., 30 May 2025).
Impact-Driven Deployment: The field is shifting toward aligning model development with SDGs and real-world utility, emphasizing robustness to domain shift, adaptability under label scarcity, and tangible societal benefits.
Collaborative and Modular Research: Close involvement of subject matter experts, modular evaluation frameworks (able to ingest new models/tasks), and continuous open access to datasets and models are essential for advancing the field.

Geospatial foundation models represent a convergence of large-scale multimodal modeling, self-supervised learning, adaptive fine-tuning, and open science. Their continued evolution is set to drive advances in Earth observation and spatial artificial intelligence, with direct consequences for sustainable development, scientific research, and global resilience.