C-SAW: Self-Supervised Prompt Learning for Image Generalization in Remote Sensing (2311.15812v1)
Abstract: We focus on domain and class generalization problems in analyzing optical remote sensing images, using the large-scale pre-trained vision-LLM (VLM), CLIP. While contrastively trained VLMs show impressive zero-shot generalization performance, their effectiveness is limited when dealing with diverse domains during training and testing. Existing prompt learning techniques overlook the importance of incorporating domain and content information into the prompts, which results in a drop in performance while dealing with such multi-domain data. To address these challenges, we propose a solution that ensures domain-invariant prompt learning while enhancing the expressiveness of visual features. We observe that CLIP's vision encoder struggles to identify contextual image information, particularly when image patches are jumbled up. This issue is especially severe in optical remote sensing images, where land-cover classes exhibit well-defined contextual appearances. To this end, we introduce C-SAW, a method that complements CLIP with a self-supervised loss in the visual space and a novel prompt learning technique that emphasizes both visual domain and content-specific features. We keep the CLIP backbone frozen and introduce a small set of projectors for both the CLIP encoders to train C-SAW contrastively. Experimental results demonstrate the superiority of C-SAW across multiple remote sensing benchmarks and different generalization tasks.
- VQA: Visual Question Answering. arXiv:1505.00468 [cs.CL]
- Geography-aware self-supervised learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10181–10190.
- Neural codes for image retrieval. In European conference on computer vision. Springer, 584–599.
- Metareg: Towards domain generalization using meta-regularization. Advances in neural information processing systems 31 (2018).
- StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based Domain Generalization. arXiv preprint arXiv:2302.09251 (2023).
- Hallucinating agnostic images to generalize across domains. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). IEEE, 3227–3234.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660.
- Learning to balance specificity and invariance for in and out of domain generalization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 301–318.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020).
- Yuxing Chen and Lorenzo Bruzzone. 2021. Self-supervised change detection by fusing SAR and optical multi-temporal images. In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS. IEEE, 3101–3104.
- Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 105, 10 (2017), 1865–1883.
- Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems 35 (2022), 197–211.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11583–11592.
- Self-supervised representation learning for remote sensing image change detection based on temporal prediction. Remote Sensing 12, 11 (2020), 1868.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR abs/2010.11929 (2020). arXiv:2010.11929 https://arxiv.org/abs/2010.11929
- Domain generalization via model-agnostic learning of semantic features. Advances in Neural Information Processing Systems 32 (2019).
- Cert: Contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766 (2020).
- Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning. PMLR, 1126–1135.
- Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180–1189.
- Domain-adversarial training of neural networks. The journal of machine learning research 17, 1 (2016), 2096–2030.
- Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544 (2021).
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12, 7 (2019), 2217–2226.
- Machine learning for environmental monitoring. Nature Sustainability 1, 10 (2018), 583–588.
- Self-challenging improves cross-domain generalization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 124–140.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904–4916.
- Deep unsupervised embedding for remotely sensed images based on spatially augmented momentum contrast. IEEE Transactions on Geoscience and Remote Sensing 59, 3 (2020), 2598–2610.
- MaPLe: Multi-Modal Prompt Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19113–19122.
- Revisiting self-supervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1920–1929.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25 (2012).
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
- Learning to generalize: Meta-learning for domain generalization. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
- Episodic training for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1446–1455.
- Patternnet: Visual pattern mining with deep neural network. In Proceedings of the 2018 ACM on international conference on multimedia retrieval. 291–299.
- Global and local contrastive self-supervised learning for semantic segmentation of HR remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–14.
- Domain generalization with adversarial feature learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5400–5409.
- Progressive domain expansion network for single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 224–233.
- Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
- Geographical knowledge-driven representation learning for remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–16.
- Deep domain generalization via conditional invariant adversarial networks. In Proceedings of the European conference on computer vision (ECCV). 624–639.
- Feature-critic networks for heterogeneous domain generalization. In International Conference on Machine Learning. PMLR, 3915–3924.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019).
- Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56, 4 (2017), 2183–2195.
- Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5206–5215.
- MARE: Self-supervised multi-attention REsu-Net for semantic segmentation in remote sensing. Remote Sensing 13, 16 (2021), 3275.
- Toshihiko Matsuura and Tatsuya Harada. 2020. Domain generalization using a mixture of multiple latent domains. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11749–11756.
- Slip: Self-supervision meets language-image pre-training. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI. Springer, 529–544.
- Reducing domain gap by reducing style bias. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8690–8699.
- Claudio Persello and Lorenzo Bruzzone. 2014. Relevant and invariant feature selection of hyperspectral images for domain generalization. In 2014 IEEE Geoscience and Remote Sensing Symposium. IEEE, 3562–3565.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066 (2019).
- MLRSNet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding. ISPRS Journal of Photogrammetry and Remote Sensing 169 (2020), 337–350.
- Fengchun Qiao and Xi Peng. 2021. Uncertainty-guided model generalization to unseen domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6790–6800.
- Learning to learn single domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12556–12565.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
- Improving language understanding by generative pre-training. (2018).
- Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400–407.
- Floyd F Sabins. 1999. Remote sensing for mineral exploration. Ore geology reviews 14, 3-4 (1999), 157–183.
- Generalizing across domains via cross-gradient training. arXiv preprint arXiv:1804.10745 (2018).
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020).
- APPLeNet: Visual Attention Parameterized Prompt Learning for Few-Shot Remote Sensing Image Generalization Using CLIP. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. 2023–2033.
- Vladan Stojnić and Vladimir Risojević. 2018. Evaluation of split-brain autoencoders for high-resolution remote sensing scene classification. In 2018 International Symposium ELMAR. IEEE, 67–70.
- Vladan Stojnic and Vladimir Risojevic. 2021. Self-supervised learning of remote sensing scene representations using contrastive multiview coding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1182–1191.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
- Self-supervised remote sensing feature learning: Learning paradigms, challenges, and future works. IEEE Transactions on Geoscience and Remote Sensing (2023).
- Remote sensing image scene classification with self-supervised paradigm under limited labeled samples. IEEE Geoscience and Remote Sensing Letters 19 (2020), 1–5.
- Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer, 776–794.
- The information bottleneck method. arXiv preprint physics/0004057 (2000).
- Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558 (2017).
- Domain adaptation for the classification of remote sensing data: An overview of recent advances. IEEE geoscience and remote sensing magazine 4, 2 (2016), 41–57.
- Vladimir N. Vapnik. 1998. Statistical Learning Theory. Wiley-Interscience.
- The color out of space: learning self-supervised representations for earth observation imagery. In 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 3034–3041.
- Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156–3164.
- Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems 31 (2018).
- Self-supervised learning in remote sensing: A review. arXiv preprint arXiv:2206.13188 (2022).
- Learning to diversify for single domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 834–843.
- SimDE: A Simple Domain Expansion Approach for Single-Source Domain Generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4797–4807.
- A fourier-based framework for domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14383–14392.
- Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning. PMLR, 12310–12320.
- S4l: Self-supervised semi-supervised learning. In Proceedings of the IEEE/CVF international conference on computer vision. 1476–1485.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930 (2021).
- Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1058–1067.
- FALSE: False Negative Samples Aware Contrastive Learning for Semantic Segmentation of High-Resolution Remote Sensing Image. IEEE Geoscience and Remote Sensing Letters 19 (2022), 1–5.
- Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 14435–14447. https://proceedings.neurips.cc/paper_files/paper/2020/file/a5bfc9e07964f8dddeb95fc584cd965d-Paper.pdf
- Multisource-domain generalization-based oil palm tree detection using very-high-resolution (vhr) satellite images. IEEE Geoscience and Remote Sensing Letters 19 (2021), 1–5.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16816–16825.
- Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
- Deep domain-adversarial image generation for domain generalisation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13025–13032.
- Domain generalization with mixstyle. arXiv preprint arXiv:2104.02008 (2021).
- Prompt-aligned Gradient for Prompt Tuning. arXiv preprint arXiv:2205.14865 (2022).
- Avigyan Bhattacharya (4 papers)
- Mainak Singha (20 papers)
- Ankit Jha (19 papers)
- Biplab Banerjee (63 papers)