CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding (2310.00022v4)
Abstract: Learning representations through self-supervision on unlabeled data has proven highly effective for understanding diverse images. However, remote sensing images often have complex and densely populated scenes with multiple land objects and no clear foreground objects. This intrinsic property generates high object density, resulting in false positive pairs or missing contextual information in self-supervised learning. To address these problems, we propose a context-enhanced masked image modeling method (CtxMIM), a simple yet efficient MIM-based self-supervised learning for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. A context-enhanced generative branch is introduced to provide contextual information through context consistency constraints in the reconstruction. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset without specific temporal or geographical constraints. Finally, extensive experiments show that features learned by CtxMIM outperform fully supervised and state-of-the-art self-supervised learning methods on various downstream tasks, including land cover classification, semantic segmentation, object detection, and instance segmentation. These results demonstrate that CtxMIM learns impressive remote sensing representations with high generalization and transferability. Code and data will be made public available.
- Masked siamese networks for label-efficient learning, in: ECCV, pp. 456–473.
- Geography-aware self-supervised learning, in: ICCV, pp. 10181–10190.
- Multimae: Multi-modal multi-task masked autoencoders, in: ECCV, pp. 348–367.
- Geographic mapping with unsupervised multi-modal representation learning from vhr images and pois. ISPRS Journal of Photogrammetry and Remote Sensing 201, 193–208.
- Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 .
- Satlaspretrain: A large-scale dataset for remote sensing image understanding, in: ICCV, pp. 16772–16782.
- Unsupervised learning of visual features by contrasting cluster assignments, in: NeurIPS, pp. 9912–9924.
- Exchange means change: An unsupervised single-temporal change detection framework based on intra-and inter-image patch exchange. ISPRS Journal of Photogrammetry and Remote Sensing 206, 87–105.
- A simple framework for contrastive learning of visual representations, in: ICML, pp. 1597–1607.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 .
- Exploring simple siamese representation learning, in: CVPR, pp. 15750–15758.
- Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105, 1865–1883.
- Functional map of the world, in: CVPR, pp. 6172–6180.
- Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery, in: NeurIPS, pp. 197–211.
- An image is worth 16x16 words: Transformers for image recognition at scale, in: ICLR.
- Sentinel-2: Esa’s optical high-resolution mission for gmes operational services. Remote Sens. Environ. 120, 25–36.
- Cit: Content-invariant translation with hybrid attention mechanism for unsupervised change detection. ISPRS Journal of Photogrammetry and Remote Sensing 204, 321–339.
- Corrupted image modeling for self-supervised visual pre-training. arXiv preprint arXiv:2202.03382 .
- Deep unsupervised learning for 3d als point clouds change detection. ISPRS Open Journal of Photogrammetry and Remote Sensing 9, 100044.
- Google earth engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 202, 18–27.
- Bootstrap your own latent-a new approach to self-supervised learning, in: NeurIPS, pp. 21271–21284.
- Contrastive learning from extremely augmented skeleton sequences for self-supervised action recognition, in: AAAI, pp. 762–770.
- Siamese masked autoencoders, in: NeurIPS.
- A self-supervised remote sensing image fusion framework with dual-stage self-learning and spectral super-resolution injection. ISPRS Journal of Photogrammetry and Remote Sensing 204, 131–144.
- Masked autoencoders are scalable vision learners, in: CVPR, pp. 16000–16009.
- Momentum contrast for unsupervised visual representation learning, in: CVPR, pp. 9729–9738.
- Mask r-cnn, in: ICCV, pp. 2961–2969.
- Deep residual learning for image recognition, in: CVPR, pp. 770–778.
- Ast: Adaptive self-supervised transformer for optical remote sensing representation. ISPRS Journal of Photogrammetry and Remote Sensing 200, 41–54.
- Semantic segmentation of remote sensing images with self-supervised semantic-aware inpainting. IEEE Geosci. Remote Sens. Lett. 19, 1–5.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 12, 2217–2226.
- Spectralgpt: Spectral remote sensing foundation model. IEEE Trans. Pattern Anal. Mach. Intell. .
- Generic knowledge boosted pre-training for remote sensing images. IEEE Trans. Geosci. Remote Sens. .
- Contrastive self-supervised learning with smoothed representation for remote sensing. IEEE Geosci. Remote Sens. Lett. 19, 1–5.
- Deep unsupervised embedding for remotely sensed images based on spatially augmented momentum contrast. IEEE Trans. Geosci. Remote Sens. 59, 2598–2610.
- Contrastive adversarial learning for person independent facial emotion recognition, in: AAAI, pp. 5948–5956.
- xview: Objects in context in overhead imagery. arXiv preprint arXiv:1802.07856 .
- Global and local contrastive self-supervised learning for semantic segmentation of hr remote sensing images. IEEE Trans. Geosci. Remote Sens. 60, 1–14.
- Semantic segmentation of remote sensing images with self-supervised multitask representation learning. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 14, 6438–6450.
- Geographical knowledge-driven representation learning for remote sensing images. IEEE Trans. Geosci. Remote Sens. 60, 1–16.
- Geographical supervision correction for remote sensing representation learning. IEEE Trans. Geosci. Remote Sens. 60, 1–20.
- Contrastive clustering, in: AAAI, pp. 8547–8555.
- Focal loss for dense object detection, in: ICCV, pp. 2980–2988.
- Microsoft coco: Common objects in context, in: ECCV, Springer. pp. 740–755.
- Mixmae: Mixed and masked autoencoder for efficient pretraining of hierarchical vision transformers, in: CVPR, pp. 6252–6261.
- Good helper is around you: Attention-driven masked image modeling, in: AAAI, pp. 1799–1807.
- Swin transformer: Hierarchical vision transformer using shifted windows, in: ICCV, pp. 10012–10022.
- On creating benchmark dataset for aerial image interpretation: Reviews, guidances, and million-aid. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. 14, 4205–4230.
- Change-aware sampling and contrastive learning for satellite images, in: CVPR, pp. 5261–5270.
- Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data, in: ICCV, pp. 9414–9423.
- Towards geospatial foundation models via continual pretraining, in: ICCV, pp. 16806–16816.
- Index your position: A novel self-supervised learning method for remote sensing images semantic segmentation. IEEE Trans. Geosci. Remote Sens. 60, 1–11.
- Cmid: A unified self-supervised learning framework for remote sensing image understanding. IEEE Trans. Geosci. Remote Sens. .
- Rethinking transformers pre-training for multi-spectral satellite imagery. arXiv preprint arXiv:2403.05419 .
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 .
- Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning, in: ICCV, pp. 4088–4099.
- Mean-shifted contrastive loss for anomaly detection, in: AAAI, pp. 2155–2162.
- Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252.
- Self-supervised vision transformers for land-cover segmentation and classification, in: CVPR, pp. 1422–1431.
- Self-supervised learning of remote sensing scene representations using contrastive multiview coding, in: CVPR, pp. 1182–1191.
- Supervised and self-supervised learning-based cascade spatiotemporal fusion framework and its application. ISPRS Journal of Photogrammetry and Remote Sensing 203, 19–36.
- Cross-scale mae: A tale of multiscale exploitation in remote sensing, in: NeurIPS.
- Semantic segmentation in aerial imagery using multi-level contrastive learning with local consistency, in: WACV, pp. 3798–3807.
- Tov: The original vision model for optical remote sensing image understanding via self-supervised learning. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. .
- Contrastive multiview coding, in: ECCV, pp. 776–794.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training, in: NeurIPS, pp. 10078–10093.
- Spacenet: A remote sensing dataset and challenge series. arXiv preprint arXiv:1807.01232 .
- The color out of space: learning self-supervised representations for earth observation imagery, in: ICPR, pp. 3034–3041.
- A self-supervised deep denoiser for hyperspectral and multispectral image fusion. IEEE Trans. Geosci. Remote Sens. .
- Masked feature prediction for self-supervised visual pre-training, in: CVPR, pp. 14668–14678.
- Dota: A large-scale dataset for object detection in aerial images, in: CVPR, pp. 3974–3983.
- Simmim: A simple framework for masked image modeling, in: CVPR, pp. 9653–9663.
- Self-supervised pre-training for large-scale crop mapping using sentinel-2 time series. ISPRS Journal of Photogrammetry and Remote Sensing 207, 312–325.
- Barlow twins: Self-supervised learning via redundancy reduction, in: ICML, pp. 12310–12320.
- Task-specific contrastive learning for few-shot remote sensing image scene classification. ISPRS Journal of Photogrammetry and Remote Sensing 191, 143–154.
- Context-based contrastive learning for scene text recognition, in: AAAI, pp. 3353–3361.
- Psanet: Point-wise spatial attention network for scene parsing, in: ECCV, pp. 267–283.