This paper introduces the Low-rank Efficient Spatial-Spectral Vision Transformer (LESS ViT), a novel architecture designed specifically for multi-modal and hyperspectral geospatial raster data, addressing the scalability and efficiency limitations of existing methods when dealing with numerous spectral channels and sensing modalities.
The key challenge highlighted is that standard Vision Transformers (ViTs) and their adaptations for geospatial data are often computationally expensive (e.g., O(N2C2) complexity for spatial-spectral attention) and don't fully exploit the unique properties of geospatial data, such as spatial autocorrelation and the physical meaning of spectral channels.
To overcome these limitations, LESS ViT incorporates three main innovations:
- Hyperspectral Patch Embedding:
- Tied Patch Embedding Layer: Uses a shared projection matrix across all channels to embed P×P patches into D-dimensional tokens. This ensures channel independence and adaptability to varying numbers of channels.
- Continuous Positional-Channel Embedding: Instead of standard grid-based positional embeddings, it uses sinusoidal embeddings modified to incorporate physical spatial resolution (r) and patch size (p), calculating actual geographic distances. It also encodes spectral information using the central wavelength (λ) of each channel, allowing generalization across different sensors and band configurations. The spatial and spectral embeddings are summed.
- Specialized [CLS] Tokens: Includes spatial [CLS] tokens (summarizing each spatial location across channels), spectral [CLS] tokens (summarizing each channel across space), and a global [CLS] token to capture different levels of context.
- LESS Attention Block:
- Addresses the O(N2C2) complexity of naive spatial-spectral attention.
- It decomposes the input tokens (X∈RN×C×D) into spatial-only (XS∈RN×d1) and spectral-only (XC∈RC×d2) tokens using an Attention Pooling mechanism (AttenPool), where d1d2=D.
- Computes spatial attention (YS=ASVS) and spectral attention (YC=ACVC) separately.
- Approximates the full spatial-spectral attention output (Y) using the Kronecker product of the spatial and spectral outputs: Y≈YC⊗YS. This significantly reduces complexity to O(N2d1+C2d2+NCD), scaling linearly with the number of patches and channels (O(NC) overall). The rank r of the approximation can be controlled.
- Perception Field Mask:
- Applies Tobler's first law of geography ("near things are more related") by constraining the spatial attention mechanism.
- Each token only attends to other tokens within a specified physical distance threshold (in meters), rather than a fixed grid neighborhood.
- This makes the attention mechanism inherently resolution-invariant, allowing the model to process images of varying sizes without resizing, as the spatial area covered by the attention field remains consistent.
For pretraining, the paper proposes Hyperspectral Masked Autoencoder (Hyper-MAE):
- It extends the Masked Autoencoder (MAE) framework using the LESS ViT architecture for both encoder and decoder.
- It employs decoupled spatial and spectral masking: randomly masking a high percentage of spatial patches (e.g., 75%) and a significant percentage of spectral channels (e.g., 50%).
- Crucially, it applies identical spatial masks across all unmasked channels (similar to tube masking in video), forcing the model to learn spatial structures without relying on information from the same location in other channels.
- The decoder reconstructs the full hyperspectral image (pixels in masked patches and channels).
- The loss function is decomposed into spatial (Lspatial) and spectral (Lspectral) MSE components, calculated only over masked pixels. Pixels masked in both dimensions contribute to both losses, emphasizing challenging reconstruction areas.
To standardize evaluation, the paper introduces GFM-Bench, a benchmark built using HuggingFace datasets. It includes established geospatial datasets (BigEarthNet, So2Sat, EuroSAT, SegMunich, DFC2020, MARIDA, NLCD-L) with proper validation splits and consistent metrics for classification and segmentation tasks.
Experiments and Results:
- LESS ViT-Base was pretrained on SSL4EO-S12 (Sentinel-1 SAR and Sentinel-2 MSI data) using Hyper-MAE.
- Evaluated on GFM-Bench against baselines like SatMAE, CROMA, and SpectralGPT (all ViT-Base), LESS ViT achieved competitive or superior performance on average across classification (Accuracy, mAP) and segmentation (mIoU) tasks, often with fewer parameters and lower computational cost (FLOPs, time).
- Demonstrated strong cross-satellite generalization by fine-tuning the Sentinel-pretrained model on the 20-channel Landsat NLCD-L dataset, outperforming SatMAE and the computationally expensive Channel-ViT. LESS ViT handled the increased channel count and different resolution without architectural changes.
- Qualitative results (UMAP embeddings, PCA on patch features) showed that LESS ViT learns discriminative and meaningful representations capturing class separation and land cover characteristics in both optical and radar data.
- Ablation studies confirmed the benefits of the LESS attention design (specifically allocating dimensions to spectral attention), deeper decoders in Hyper-MAE, and the potential of attention rank tuning and Mixture-of-Experts (MoE) classification heads leveraging the spatial [CLS] tokens.
- Multi-modal experiments on BigEarthNet showed comparable fine-tuning gains to CROMA but with significantly fewer parameters, although zero-shot multi-modal performance was weaker due to the single shared backbone design.
Conclusion:
LESS ViT provides a scalable, efficient, and physically grounded architecture for hyperspectral and multi-modal geospatial data. Combined with the Hyper-MAE pretraining strategy and evaluated on the standardized GFM-Bench, it represents a significant advancement for geospatial foundation models. Future work includes extending the model to temporal data, integrating vector data, and improving multi-modal fusion.