Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data (2503.12843v3)

Published 17 Mar 2025 in cs.CV and cs.AI

Abstract: Geospatial raster data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both the continuity and physical characteristics of each spatial-spectral patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.

Authors (6)

Haozhe Si (6 papers)
Yuxuan Wan (28 papers)
Minh Do (13 papers)
Deepak Vasisht (17 papers)
Han Zhao (159 papers)
Hendrik F. Hamann (5 papers)

Summary

This paper introduces the Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT), a novel architecture designed specifically for multi-modal and hyperspectral geospatial raster data, addressing the scalability and efficiency limitations of existing methods when dealing with numerous spectral channels and sensing modalities.

The key challenge highlighted is that standard Vision Transformers (ViTs) and their adaptations for geospatial data are often computationally expensive (e.g., $O(N^2C^2)$ complexity for spatial-spectral attention) and don't fully exploit the unique properties of geospatial data, such as spatial autocorrelation and the physical meaning of spectral channels.

To overcome these limitations, LESS ViT incorporates three main innovations:

Hyperspectral Patch Embedding:
- Tied Patch Embedding Layer: Uses a shared projection matrix across all channels to embed $P \times P$ patches into $D$ -dimensional tokens. This ensures channel independence and adaptability to varying numbers of channels.
- Continuous Positional-Channel Embedding: Instead of standard grid-based positional embeddings, it uses sinusoidal embeddings modified to incorporate physical spatial resolution ( $r$ ) and patch size ( $p$ ), calculating actual geographic distances. It also encodes spectral information using the central wavelength ( $\lambda$ ) of each channel, allowing generalization across different sensors and band configurations. The spatial and spectral embeddings are summed.
- Specialized [CLS] Tokens: Includes spatial [CLS] tokens (summarizing each spatial location across channels), spectral [CLS] tokens (summarizing each channel across space), and a global [CLS] token to capture different levels of context.
LESS Attention Block:
- Addresses the $O(N^2C^2)$ complexity of naive spatial-spectral attention.
- It decomposes the input tokens ( $X \in \mathbb{R}^{N \times C \times D}$ ) into spatial-only ( $X_S \in \mathbb{R}^{N \times d_1}$ ) and spectral-only ( $X_C \in \mathbb{R}^{C \times d_2}$ ) tokens using an Attention Pooling mechanism (AttenPool), where $d_1 d_2 = D$ .
- Computes spatial attention ( $Y_S = A_S V_S$ ) and spectral attention ( $Y_C = A_C V_C$ ) separately.
- Approximates the full spatial-spectral attention output ( $Y$ ) using the Kronecker product of the spatial and spectral outputs: $Y \approx Y_C \otimes Y_S$ . This significantly reduces complexity to $O(N^2d_1 + C^2d_2 + NCD)$ , scaling linearly with the number of patches and channels ( $O(NC)$ overall). The rank $r$ of the approximation can be controlled.
Perception Field Mask:
- Applies Tobler's first law of geography ("near things are more related") by constraining the spatial attention mechanism.
- Each token only attends to other tokens within a specified physical distance threshold (in meters), rather than a fixed grid neighborhood.
- This makes the attention mechanism inherently resolution-invariant, allowing the model to process images of varying sizes without resizing, as the spatial area covered by the attention field remains consistent.

For pretraining, the paper proposes Hyperspectral Masked Autoencoder (Hyper-MAE):

It extends the Masked Autoencoder (MAE) framework using the LESS ViT architecture for both encoder and decoder.
It employs decoupled spatial and spectral masking: randomly masking a high percentage of spatial patches (e.g., 75%) and a significant percentage of spectral channels (e.g., 50%).
Crucially, it applies identical spatial masks across all unmasked channels (similar to tube masking in video), forcing the model to learn spatial structures without relying on information from the same location in other channels.
The decoder reconstructs the full hyperspectral image (pixels in masked patches and channels).
The loss function is decomposed into spatial ( $\mathcal{L}_{\mathrm{spatial}}$ ) and spectral ( $\mathcal{L}_{\mathrm{spectral}}$ ) MSE components, calculated only over masked pixels. Pixels masked in both dimensions contribute to both losses, emphasizing challenging reconstruction areas.

To standardize evaluation, the paper introduces GFM-Bench, a benchmark built using HuggingFace datasets. It includes established geospatial datasets (BigEarthNet, So2Sat, EuroSAT, SegMunich, DFC2020, MARIDA, NLCD-L) with proper validation splits and consistent metrics for classification and segmentation tasks.

Experiments and Results:

LESS ViT-Base was pretrained on SSL4EO-S12 (Sentinel-1 SAR and Sentinel-2 MSI data) using Hyper-MAE.
Evaluated on GFM-Bench against baselines like SatMAE, CROMA, and SpectralGPT (all ViT-Base), LESS ViT achieved competitive or superior performance on average across classification (Accuracy, mAP) and segmentation (mIoU) tasks, often with fewer parameters and lower computational cost (FLOPs, time).
Demonstrated strong cross-satellite generalization by fine-tuning the Sentinel-pretrained model on the 20-channel Landsat NLCD-L dataset, outperforming SatMAE and the computationally expensive Channel-ViT. LESS ViT handled the increased channel count and different resolution without architectural changes.
Qualitative results (UMAP embeddings, PCA on patch features) showed that LESS ViT learns discriminative and meaningful representations capturing class separation and land cover characteristics in both optical and radar data.
Ablation studies confirmed the benefits of the LESS attention design (specifically allocating dimensions to spectral attention), deeper decoders in Hyper-MAE, and the potential of attention rank tuning and Mixture-of-Experts (MoE) classification heads leveraging the spatial [CLS] tokens.
Multi-modal experiments on BigEarthNet showed comparable fine-tuning gains to CROMA but with significantly fewer parameters, although zero-shot multi-modal performance was weaker due to the single shared backbone design.

Conclusion:

LESS ViT provides a scalable, efficient, and physically grounded architecture for hyperspectral and multi-modal geospatial data. Combined with the Hyper-MAE pretraining strategy and evaluated on the standardized GFM-Bench, it represents a significant advancement for geospatial foundation models. Future work includes extending the model to temporal data, integrating vector data, and improving multi-modal fusion.

PDF Markdown

Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data (2503.12843v3)

Summary

Related Papers