Position Prediction Self-Supervised Learning for Multimodal Satellite Imagery Semantic Segmentation (2506.06852v1)

Published 7 Jun 2025 in cs.CV and cs.AI

Abstract: Semantic segmentation of satellite imagery is crucial for Earth observation applications, but remains constrained by limited labelled training data. While self-supervised pretraining methods like Masked Autoencoders (MAE) have shown promise, they focus on reconstruction rather than localisation-a fundamental aspect of segmentation tasks. We propose adapting LOCA (Location-aware), a position prediction self-supervised learning method, for multimodal satellite imagery semantic segmentation. Our approach addresses the unique challenges of satellite data by extending SatMAE's channel grouping from multispectral to multimodal data, enabling effective handling of multiple modalities, and introducing same-group attention masking to encourage cross-modal interaction during pretraining. The method uses relative patch position prediction, encouraging spatial reasoning for localisation rather than reconstruction. We evaluate our approach on the Sen1Floods11 flood mapping dataset, where it significantly outperforms existing reconstruction-based self-supervised learning methods for satellite imagery. Our results demonstrate that position prediction tasks, when properly adapted for multimodal satellite imagery, learn representations more effective for satellite image semantic segmentation than reconstruction-based approaches.

Authors (2)

John Waithaka (1 paper)
Moise Busogi (1 paper)

Summary

Semantic Segmentation of Multimodal Satellite Imagery through Position Prediction Self-Supervised Learning

This paper introduces a novel approach to the semantic segmentation of satellite imagery, addressing the limitations inherent in previous methodologies focused primarily on reconstruction tasks. By leveraging self-supervised learning techniques, the authors aim to overcome the challenge posed by insufficient labeled training data in the field of satellite imagery analysis. The proposed model, adapted from the LOCA (Location-aware) self-supervised learning method, redefines the approach to semantic segmentation by prioritizing spatial reasoning for localization rather than image reconstruction.

Methodology Overview

The authors extend the self-supervised learning capability to multimodal satellite imagery, adapting the Masked Autoencoders (MAE) framework to accommodate this diversity. Their solution introduces the concept of channel grouping, extending SatMAE's multispectral approach to handle the varying modalities present in satellite imagery, including multispectral imagery (MSI), synthetic aperture radar (SAR), and digital elevation model (DEM).

The research implements same-group attention masking to foster cross-modal interactions, a significant enhancement that facilitates the model's ability to derive meaningful spatial relationships. In addition, the team adopts relative patch position prediction as their primary learning task, pushing the boundaries of traditional reconstruction approaches by emphasizing spatial localization, which is crucial for effective segmentation.

Experimental Validation

Evaluation conducted on the Sen1Floods11 flood mapping dataset demonstrates the superiority of the position prediction approach when applied to multimodal satellite data. This method achieved notable improvements in representation learning, outperforming reconstruction-based self-supervised approaches on semantic segmentation tasks. The incorporation of reference masking ratios and sampling techniques further elucidates how the segmentation tasks can be adapted for increased efficiency and accuracy.

Implications and Future Directions

This research presents promising implications for Earth observation applications, particularly in enhancing the accuracy of flood management, crop analysis, and climate research. The localization-centric model is expected to improve transfer learning capabilities, allowing for more precise semantic segmentation across diverse satellite datasets.

Looking forward, the authors suggest that future developments could explore scale-invariance mechanisms, leveraging temporal dimensions, and expanding the modal diversity of datasets. The introduction of these features would significantly broaden the applicability of self-supervised learning models in remote sensing.

Conclusion

By reimagining the role of spatial reasoning in the semantic segmentation of satellite imagery, this paper offers a robust framework for future research in remote sensing. The proposed LOCA-driven approach empowers satellite imagery analysis with comprehensive and more efficient handling of multimodal data, paving the way for advancements in numerous Earth observation fields.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos