CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding (2304.09670v2)

Published 19 Apr 2023 in cs.CV

Abstract: Self-supervised learning (SSL) has gained widespread attention in the remote sensing (RS) and earth observation (EO) communities owing to its ability to learn task-agnostic representations without human-annotated labels. Nevertheless, most existing RS SSL methods are limited to learning either global semantic separable or local spatial perceptible representations. We argue that this learning strategy is suboptimal in the realm of RS, since the required representations for different RS downstream tasks are often varied and complex. In this study, we proposed a unified SSL framework that is better suited for RS images representation learning. The proposed SSL framework, Contrastive Mask Image Distillation (CMID), is capable of learning representations with both global semantic separability and local spatial perceptibility by combining contrastive learning (CL) with masked image modeling (MIM) in a self-distillation way. Furthermore, our CMID learning framework is architecture-agnostic, which is compatible with both convolutional neural networks (CNN) and vision transformers (ViT), allowing CMID to be easily adapted to a variety of deep learning (DL) applications for RS understanding. Comprehensive experiments have been carried out on four downstream tasks (i.e. scene classification, semantic segmentation, object-detection, and change detection) and the results show that models pre-trained using CMID achieve better performance than other state-of-the-art SSL methods on multiple downstream tasks. The code and pre-trained models will be made available at https://github.com/NJU-LHRS/official-CMID to facilitate SSL research and speed up the development of RS images DL applications.

Authors (5)

Dilxat Muhtar (6 papers)
Xueliang Zhang (39 papers)
Pengfeng Xiao (9 papers)
Zhenshi Li (4 papers)
Feng Gu (29 papers)

Citations (31)

View on Semantic Scholar

Summary

A Detailed Overview of "CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding"

The paper "CMID: A Unified Self-Supervised Learning Framework for Remote Sensing Image Understanding" introduces a novel method for enhancing the efficacy of Self-Supervised Learning (SSL) in remote sensing (RS) image analysis. Dilxat Muhtar and colleagues have proposed Contrastive Mask Image Distillation (CMID), a framework that addresses the limitations of traditional SSL methods which typically focus on either global semantic separability or local spatial perceptibility, but not both. This dual approach is particularly significant in remote sensing where these attributes are crucial for diverse downstream tasks such as scene classification, semantic segmentation, object detection, and change detection.

Key Features of CMID

CMID innovatively integrates Contrastive Learning (CL) with Masked Image Modeling (MIM), employing a teacher-student self-distillation architecture. This unified approach allows the model to extract valuable global and local semantic representations from RS images. The framework utilizes the following novel strategies:

Mask Strategy and Frequency Domain Reconstruction: CMID mitigates the semantic loss typically induced by masking in MIM by employing a spectral mean value to replace masked patches, thus reducing semantic discrepancy. Furthermore, it incorporates focal frequency loss (FFL) to enforce consistency in the frequency domain, which aids in learning high-level semantics.
Teacher-Student Architecture: The use of a teacher-student self-distillation model in CMID offers robust guidance for representation learning, making it architecture-agnostic and effective for both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs).
Global and Local Branches: The method leverages separate branches for global and local feature learning, balancing their contributions through weighted losses. The global branch employs an MoCo-style contrastive loss to ensure semantic separability, while the local branch aligns local semantics using prototype assignment consistency to maintain spatial perceptibility.

Implications and Results

The comprehensive experiments demonstrate that CMID achieves superior results compared to state-of-the-art SSL methods across multiple RS downstream tasks. Notably, CMID pre-trained models show remarkable performance in environments with limited labeled data, which is a common scenario in remote sensing applications.

The results underscore CMID's potential in substantially reducing reliance on labeled datasets by extracting more generalizable, task-agnostic features that improve model performance in classification, segmentation, and detection tasks.

Future Directions

The paper opens several avenues for future research:

Extension to Other Data Types: While CMID is tested on RS image datasets, its generalized approach can potentially be adapted and applied to other data types such as multi-temporal or multi-modal RS data.
Exploration of Transformer Models: Given the promising results with the Swin Transformer, further investigation into how different architectures affect the SSL learning process in remote sensing may yield additional optimization strategies.
Scalability and Longer Pre-training: Considering the substantial improvements with only 200 epochs, exploring the effects of longer pre-training durations or larger datasets might further enhance representational robustness and richness.

Conclusion

The introduction of CMID marks a significant advancement in SSL for RS image understanding. By addressing the need for a unified approach that merges the strengths of MIM and CL, the framework sets a new benchmark for model performance and applicability across diverse tasks under various conditions. This work not only contributes a novel methodological perspective but also provides practical insights that can expedite further developments in SSL frameworks tailored for RS images.

Related Papers

GitHub

GitHub - NJU-LHRS/official-CMID: The official implementation of paper "Unified Self-Supervised Learning Framework for Remote Sensing Images". (82 stars)