Prithvi Geospatial Foundation Model

Updated 13 October 2025

Prithvi Geospatial Foundation Model is a transformer-based AI model pre-trained using masked autoencoding on large-scale remote sensing data, enabling efficient geospatial analysis.
It employs multi-objective continual pretraining with a teacher–student structure and integrates spatiotemporal encoding to capture seasonal, phenological, and spatial patterns.
The model demonstrates high performance across diverse tasks, including flood mapping and urban heat prediction, while emphasizing energy efficiency and ethical deployment.

The Prithvi Geospatial Foundation Model is a class of transformer-based artificial intelligence models designed specifically for large-scale, multi-modal earth observation and geospatial analysis. The Prithvi framework and its successors address the challenge of harnessing diverse remote sensing sources for downstream tasks ranging from environmental monitoring and disaster response to urban planning and climate modeling. Prithvi models are foundational in the sense that they are pre-trained on massive volumes of unlabelled remote sensing or reanalysis data using self-supervised objectives, yielding transferable representations that can be efficiently adapted with minimal labelled data for various geoscientific tasks.

1. Foundational Architecture and Pretraining Paradigms

The core Prithvi model architecture builds on a Vision Transformer (ViT) backbone extended for spatiotemporal data. Initial Prithvi models (e.g., Prithvi-EO-1.0) were trained on a masked autoencoding (MAE) task using the Harmonized Landsat–Sentinel (HLS) multispectral dataset (>1 TB), partitioning each sequence (e.g., 224×224 sized, 6–8 bands, multi-timepoint) into non-overlapping 3D patches (“tubelets” for temporal encoding), with a high masking ratio (e.g., 75%). The ViT encoder processes only unmasked patches, while a lightweight decoder reconstructs the masked inputs—optimizing the pixel-wise mean squared error:

$\mathcal{L} = \frac{1}{N}\sum_{i=1}^{N} (x_i - \hat{x}_i)^2$

This spatiotemporal MAE framework was extended for global pretraining by incorporating 3D patch embeddings and 3D sine–cosine positional encoding across height, width, and time, enhancing the model's ability to capture seasonal, phenological, and spatial correlation patterns (Jakubik et al., 2023, Szwarcman et al., 3 Dec 2024).

Notably, a multi-objective continual pretraining paradigm was introduced, employing a teacher–student structure. The teacher (initialized with ImageNet-22k weights and frozen) processes full images, supplying intermediate features for distillation, while the student (randomly initialized) is jointly trained with

a masked image modeling loss (L₁ over masked regions):

$\mathcal{L}_{\text{MIM}} = \frac{\|\mathcal{O}_\kappa - \mathcal{G}_\kappa\|_1}{N}$

a feature distillation loss (cosine similarity between teacher and student features):

$\mathcal{L}_{\text{feat}} = -\frac{P(f_l^S)}{\|P(f_l^S)\|_2} \cdot \frac{f_l^T}{\|f_l^T\|_2}$

The total loss is $\mathcal{L} = \mathcal{L}_{\text{MIM}} + \mathcal{L}_{\text{feat}}$ , allowing knowledge transfer from large natural image models while promoting acquisition of robust remote-sensing-specific features (Mendieta et al., 2023).

Later Prithvi-EO-2.0 models (300M and 600M parameters) further integrated temporal and geolocation embeddings through metadata (center latitude, longitude, acquisition date) fused into token embeddings using 2D sin/cos functions and learnable balancing weights. A dropout mechanism was applied during pretraining to promote robustness to missing metadata (Szwarcman et al., 3 Dec 2024).

2. Dataset Construction and Diversity

Recognizing that limited spatial and feature diversity impairs representation learning, Prithvi models utilize composite, high-entropy datasets. The original GeoPile consisted of ~600,000 pooled samples (NAIP, RSD46-WHU, MLRSNet, RESISC45, PatternNet), capturing a range from 0.1m to 30m resolutions and covering both labeled and unlabeled settings to encourage visual heterogeneity (Mendieta et al., 2023). For Prithvi-EO-2.0, 4.2M global time series samples were extracted from the HLS archive, employing stratified and upsampled sampling to ensure coverage of heterogeneous ecoregions, and aggressive filtering (cloud mask, spatial subdividing) to maximize signal content (Szwarcman et al., 3 Dec 2024).

For weather and climate, Prithvi WxC was constructed using 160 MERRA-2 variables: dynamic inputs (20 surface + 10 vertical vars per 14 pressure levels), static (e.g., topography, landcover), and climatology channels, normalized per parameter, to enable pretraining on gridded 4D atmospheric volumes (Schmude et al., 20 Sep 2024).

3. Downstream Tasks and Performance

Prithvi models have demonstrated generalist capabilities across a suite of geospatial tasks using fine-tuning:

Task Type	Example Datasets	Metrics	Notable Results
Flood Mapping	Sen1Floods11, OSCD	IoU, mAcc, mF1	Prithvi-EO-2.0 outperforms U-Net, Segformer-B5 on unseen data
Cloud Gap Imputation	HLS subtiles	SSIM, MAE	+5pp improvement over CGAN in SSIM (Jakubik et al., 2023, Godwin et al., 30 Apr 2024)
Crop Segmentation	CDL, Sen4Map, BigEarthNet	mIoU, acc	8% gain over previous Prithvi, robust in low-label regime
Wildfire Scar Mapping	BurnScars, wildfire scar datasets	mIoU	Top scores, +3.5–5.6 IoU over prior Prithvi (Szwarcman et al., 3 Dec 2024)
Urban Heat	Landsat+ERA5-Land	MAE, RMSE	MAE ⩽ 1.74°C, extrapolation up to 3.62°C (Kreismann, 20 Sep 2025)
Locust Breeding	HLS, UN FAO	acc, F1, ROC-AUC	F1 = 81.53, ROC-AUC = 87.69 (Yusuf et al., 11 Mar 2024)
Ocean Colour/PP	Sentinel-3 OLCI	RMSE, SSIM	+11.8% RMSE improvement for primary production (Dawson et al., 25 Sep 2025)
Weather/Climate	MERRA-2, ERA5-Land	RMSE, correlation	4× downscaling RMSE reduction, strong hurricane track preds

Results consistently demonstrate that pretraining on diverse, multi-temporal data and including domain-specific self-supervised objectives confers improved performance, transferability, and rapid convergence relative to both scratch- and ImageNet-initialized models. For example, regarding flood mapping, Prithvi achieves the highest mIoU and mAcc (∼4% better on unseen regions) and outperforms SatMAE and transformer baselines (Li et al., 2023, Jakubik et al., 2023, Szwarcman et al., 3 Dec 2024). Data efficiency is also a key finding: with pretraining, label efficiency is high—e.g., reducing samples by 90% yields only marginal performance drops (Jakubik et al., 2023).

4. Domain Adaptation, Composition, and Efficiency

Adaptability across sensor modalities, spatial bands, and geographic scales is a critical research axis:

Band Adaptation: Three main techniques allow conversion from 6-band to 3-band (RGB) inputs: zero-padding, channel duplication, and retrained patch embedding (preferred for cross-dataset adaptation), with retrained patch embedding reducing parameters by ∼590k (Hsu et al., 31 Aug 2024).
Multi-scale Features: Pyramid-style and dedicated multi-scale feature modules are integrated atop transformer backbones to improve object detection/segmentation, especially for small or multi-scale structures (Hsu et al., 31 Aug 2024).
Parameter-efficient adaptation: DEFLECT (“Deflecting Embeddings for Finetuning Latent representations for Earth and Climate Tasks”) extends patch embeddings and attention heads to separate geometric (RGB) and novel spectral components, preserving the norm of the pretrained latent space while enabling high-accuracy multispectral adaptation with <1% additional parameters (Thoreau et al., 12 Mar 2025). This strategy matches or exceeds LoRA and full fine-tuning performance with much lower cost.

Feature-level ensembling is also explored: concatenating embeddings from complementary models (e.g., Prithvi_300M + Hiera_200M) enhances generalization across datasets while enabling knowledge distillation into smaller models for resource-constrained scenarios. For example, on GEO-Bench, feature-level ensembles rival the largest single models with lower computational cost (Chuc, 25 Jun 2025).

Prithvi models emphasize energy and resource efficiency. For instance, Prithvi's continual pretraining scheme reduces GPU hours (93.3 h vs. 768 h for SatMAE) and carbon footprint (13.3 vs. 109.44 kg CO₂), leveraging off-the-shelf pretrained weights and efficient teacher–student training (Mendieta et al., 2023). Fine-tuning only the decoder head further reduces training energy by up to 168% (Ghamisi et al., 30 May 2025).

5. Open Science, Community Tools, and Evaluation

Prithvi and related models are notable for their open-source releases and extensive community tooling:

Model weights, fine-tuning workflows, and example notebooks are published on Hugging Face and GitHub (e.g., (Szwarcman et al., 3 Dec 2024), https://huggingface.co/ibm-nasa-geospatial/Prithvi-EO-2.0).
TerraTorch (Gomes et al., 26 Mar 2025) provides a modular fine-tuning and benchmarking toolkit built on PyTorch Lightning, integrating GeoFMs such as Prithvi, pre-defined data modules (multi-channel, multi-temporal), automated hyperparameter optimization, and interface with GEO-Bench for systematic model evaluation.
Benchmarks such as GEO-Bench (Szwarcman et al., 3 Dec 2024), PhilEO Bench (Fibaek et al., 9 Jan 2024), and SustainFM (Ghamisi et al., 30 May 2025) allow controlled, large-scale comparison of GeoFMs across classification, segmentation, regression, and SDG-linked tasks, emphasizing not only accuracy but also convergence, transferability, and energy/cost aspects.

Community workflow supports real-world application development through efficient adaptation for SME-driven projects (e.g., disaster response, crop/land use mapping), with detailed model cards and example code lowering the barrier to domain adoption (Szwarcman et al., 3 Dec 2024).

6. Ethical, Privacy, and Practical Considerations

The advance of GeoAIs like Prithvi raises challenges for privacy, security, and responsible deployment:

Risks of memorization and leakage of sensitive geospatial data are identified at all stages: pretraining, fine-tuning, deployment, and feedback cycles (Rao et al., 2023).
Control strategies include geomasking, differential privacy, distributed/federated learning, robust geospatial API protocols, malicious prompt detection, and secure feedback monitoring.
Transparent reporting of energy and carbon metrics, as well as explicit mitigation for geographic, socioeconomic, or sensor biases, are integral to responsible FM deployment, especially in SDG and policy contexts (Ghamisi et al., 30 May 2025).
The paradigm shift from a purely model-centric view to impact-driven deployment is advocated, ensuring alignment with practical outcomes and ethical standards.

7. Implications and Future Directions

Prithvi exemplifies a new generation of geospatial foundation models characterized by:

High generalization across tasks, spatial/temporal domains, spectral inputs, and downstream architectures.
Robust data and energy efficiency via strategic pretraining, parameter-efficient adaptation, and efficient fine-tuning workflows.
Openness and extensibility, with integration into community toolkits, benchmarks, and trusted open science frameworks.
Applicability to a broad spectrum of scientific, policy, and societal challenges—from climate adaptation and ecosystem monitoring to urban resilience, early warning systems, and sustainable development.

Research trends point toward even more universal models leveraging multi-modal (optical, radar, climate, text, etc.) input, joint spatial–temporal–semantic alignment, composition of specialized models via ensembling/distillation, and rigorous ethical protocols. Recent embedding field models (e.g., AlphaEarth Foundations (Brown et al., 29 Jul 2025)) aggregate multi-sensor, multi-temporal, and multi-source information in unified bottlenecks, outlining further paths for global, scalable, and data-efficient mapping even in scarce-label regimes.

Collectively, Prithvi and its successors provide a scalable, efficient, and flexible foundation for geospatial AI, supporting a range of analytical workflows in Earth system science and sustainable development.