TerraMind: EO Multimodal Foundation Model

Updated 13 October 2025

TerraMind is a multimodal foundation model for Earth observation that fuses token-level and pixel-level representations to enable any-to-any modality synthesis.
It integrates nine distinct geospatial modalities using a dual-scale architecture, achieving state-of-the-art performance on benchmarks like EuroSAT and PANGAEA.
The model leverages the open-source TerraMesh dataset and Thinking-in-Modalities approach to improve segmentation, mapping, and anomaly detection in EO tasks.

TerraMind is a multimodal foundation model for Earth observation (EO), distinguished by its dual-scale generative architecture producing any-to-any modality output. Developed under the auspices of the European Space Agency (ESA), TerraMind applies advances in transformer-based learning and multimodal fusion to a large corpus of globally distributed geospatial data. It exemplifies recent trends in foundation model design, incorporating both pixel-level and token-level representations to learn cross-modal relationships and fine-grained spatial detail. Through open-sourced weights and data, TerraMind has established benchmarks in EO tasks, and its general methodology serves as a reference for extending foundation models to other scientific communities such as astrobiology.

1. Dual-Scale Multimodal Model Architecture

The chief architectural innovation in TerraMind is its dual-scale representation, fusing both token-level and pixel-level features across nine geospatial modalities. Image-like inputs—Sentinel-2 multispectral imagery, Sentinel-1 SAR, land-use/land-cover (LULC) maps—are discretized into tokens via autoencoders with finite-scalar quantization (FSQ), where each $16 \times 16$ pixel patch is compressed to a discrete token value. In parallel, raw pixels are projected into a latent representation using learnable linear mappings, in a fashion analogous to patch embedding in Vision Transformers. These representations undergo early fusion in the encoder to jointly encode coarse semantic context and spatial nuance.

Training employs a masked token reconstruction task, with input/select tokens sampled from a Dirichlet distribution:

$f( x | \alpha ) = \frac{1}{B(\alpha)} \prod_{i=1}^M p_i^{(\alpha_i-1)},$

where $p$ is a vector of sampling probabilities and $B(\alpha)$ is the multivariate beta function. The main objective is cross entropy:

$L_{CE} = - \sum y_i \log(p_i)$

where the upper bound is $\log N$ for random predictions ( $N$ is vocabulary size). This design affords the model "any-to-any" generative capacity: any subset of input modalities can be mapped to any other targeted modality.

A significant feature is "Thinking-in-Modalities" (TiM): the model generates auxiliary artificial data during finetuning and inference—such as conditional LULC maps from optical/radar input—to aid downstream performance in segmentation or mapping tasks. This approach is inspired by chain-of-thought reasoning in LLMs but adapted for multimodal EO data.

2. Pretraining Dataset: TerraMesh

TerraMind is pretrained on TerraMesh, a global, large-scale EO dataset encompassing over 9 million spatiotemporally aligned samples across nine modalities:

Modality	Description	Resolution
Sentinel-2 Optical	L1C & L2A reflectance images	10 m
Sentinel-1 SAR	Ground Range Detected & Radiometrically Terrain Corrected	10 m
NDVI Maps	Normalized Difference Vegetation Index from Sentinel-2	10 m
DEM	Copernicus Digital Elevation Model for topography	10 m
LULC ESRI	Land-use/land-cover maps, cloud-masked via SEnSeI v2	10 m
Automated Captions	Generated via LLaVA-Next + Overture Maps context	—
Geolocation Tokens	Discrete sequence tokens encoding grid position	—

Each sample is carefully co-registered, and the dataset is subsampled for balanced representation of ecoregions and LULC classes. The model weights, full pretraining corpus, and implementation code are publicly released under a permissive license, enabling reproducibility and community experimentation.

3. Task Performance and Benchmarks

TerraMind achieves beyond state-of-the-art results on multiple EO benchmarks. On EuroSAT, the base variant (v1-B) records mean accuracies of approximately 70–88% in 1-shot and 5-shot settings—surpassing CLIP vision encoders and geospatial-specific models. In the PANGAEA community-standard segmentation benchmark (nine datasets), TerraMindv1-B obtains a mean intersection-over-union (mIoU) of 58.35% and average rank under 4, exceeding U-Net and ViT baselines by 1–4 percentage points.

Zero-shot capabilities are demonstrated with direct generation of water body maps and geolocalization predictions from optical or SAR input, without task-specific finetuning. When employing TiM-generated artificial modalities (e.g., pseudo-LULC maps), crop mapping tasks obtain up to 1 percentage point increases in mIoU relative to standard finetuning methods.

Ablation studies validate dual-scale fusion: models trained with both token-level and pixel-level signals outperform single-scale alternatives, highlighting the necessity of hybrid spatial-semantic representation for multimodal EO data.

4. Any-to-Any Generative Multimodality

A defining trait of TerraMind is complete flexibility in modality mapping. During inference and finetuning, arbitrary combinations of input EO data (optical, SAR, DEM, LULC, NDVI, automated captions, geolocation tokens) can be synthesized into targeted output modalities. This enables, for example, filling missing channels (e.g., predicting LULC under cloud cover using SAR), or generating enhanced semantic information (segmentation, anomaly detection) using available sensor combinations.

TiM (Thinking-in-Modalities) acts as an auxiliary generator, producing synthetic side information dynamically conditioned on the available input stack. The resultant artificial data offers improvements in supervised and semi-supervised tasks, especially when real data channels are corrupted or absent—a common real-world scenario in EO workflows.

5. Implications for Geospatial and Scientific Research

TerraMind operationalizes foundation model principles for structured geospatial data, bridging the gap between limited unimodal predictors and versatile multimodal models previously confined to NLP and general-purpose vision. The ability to rapidly adapt (few-shot) or generalize (zero-shot) to novel downstream EO tasks facilitates robust deployment in diverse settings: disaster response, agricultural monitoring, climate modeling, and environmental resource management.

The open-source release of TerraMind and TerraMesh substantially lowers the entry barrier for research groups, encouraging community-led investigation of nonlinear calibration, self-supervised feature extraction, and advanced sensor fusion for spatial inference. A plausible implication is increased interoperability between EO research domains and adjacent areas—such as plant phenotyping (cf. TERRA-REF dataset (LeBauer et al., 2021)) and astrobiology foundation modeling (Felton et al., 8 Oct 2025).

6. Links to Astrobiology and Future Model Generalization

TerraMind serves as a methodological precursor for multimodal foundation models in other scientific sectors. In astrobiology, similar transformer-based architectures—exploiting latent-space modeling and cross-modal reasoning—are envisaged for biosignature detection, mission optimization, and literature synthesis (Felton et al., 8 Oct 2025). The harmonized data fusion and generative capacity of TerraMind, especially its ability to integrate and regenerate missing modalities, is aligned with the needs of autonomous scientific exploration on planetary missions, where "life as we don't know it" must be inferred from sparse, heterogeneous data.

Future work is directed toward multi-temporal dynamics, integration of hyperspectral and higher-resolution data streams, and adaptation of model structure for urgent applications in complex geospatial environments. The continued evolution of TerraMind is expected to enhance not only EO foundation modeling but also interdisciplinary and operational capacities in domains where multimodality and generative synthesis are critical.

PDF Markdown Chat (Pro)

References (2)

What Does TERRA-REF's High Resolution, Multi Sensor Plant Sensing Public Domain Data Offer the Computer Vision Community? (2021)

Foundation Models for Astrobiology: Paper I -- Workshop and Overview (2025)

Follow Topic

Get notified by email when new papers are published related to TerraMind.

TerraMind: EO Multimodal Foundation Model

1. Dual-Scale Multimodal Model Architecture

2. Pretraining Dataset: TerraMesh

3. Task Performance and Benchmarks

4. Any-to-Any Generative Multimodality

5. Implications for Geospatial and Scientific Research

6. Links to Astrobiology and Future Model Generalization

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TerraMind: EO Multimodal Foundation Model

1. Dual-Scale Multimodal Model Architecture

2. Pretraining Dataset: TerraMesh

3. Task Performance and Benchmarks

4. Any-to-Any Generative Multimodality

5. Implications for Geospatial and Scientific Research

6. Links to Astrobiology and Future Model Generalization

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research