RemoteCLIP: A Vision Language Foundation Model for Remote Sensing (2306.11029v4)

Published 19 Jun 2023 in cs.CV

Abstract: General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 $\times$ larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, $\textit{k}$-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. Project website: https://github.com/ChenDelong1999/RemoteCLIP

PDF Abstract

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

The paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" proposes an innovative vision-LLM designed to bridge the gap between image and text understanding within the remote sensing domain. Traditional self-supervised learning (SSL) methods, especially Masked Image Modeling (MIM), have not adequately addressed the integration of language understanding, primarily targeting low-level feature extraction. This limitation underscores the need for an advanced method capable of sophisticated semantic understanding and cross-modal applicability. RemoteCLIP seeks to fill this void by leveraging the capabilities of CLIP models, enhanced through domain-specific continuous pretraining and a novel data scaling approach.

Methodology and Implementation

The development of RemoteCLIP is rooted in a twofold approach: harnessing the proven effectiveness of CLIP models and addressing data scarcity issues inherent to remote sensing. Here's a detailed breakdown of the paper’s key methodological contributions:

Contrastive Language Image Pretraining (CLIP): The authors emphasize the potential of CLIP models, which optimize an InfoNCE loss to align paired image-text samples and separate mismatched ones. This alignment fosters robust visual and semantic feature learning, imperative for downstream remote sensing applications.
Data Scaling via Annotation Unification: They introduce innovative techniques like Box-to-Caption (B2C) generation and Mask-to-Box (M2B) conversion, effectively transforming heterogeneous annotations from existing datasets into a unified image-caption format. This process dramatically increases the dataset scale, making it 12 times larger than the combination of all available remote sensing datasets.

Strong Numerical Results

RemoteCLIP demonstrates significant performance improvements across various tasks, surpassing previous models by notable margins:

Retrieval Performance: When tested on image-text retrieval benchmarks like RSITMD and RSICD, RemoteCLIP consistently outperforms existing methods. For instance, it improves mean recall scores by 9.14% and 8.92% respectively, over the best prior models.
Zero-Shot Classification: The model showcases enhanced zero-shot learning capabilities, with an up to 6.39% increase in average accuracy over baseline CLIP models across 12 downstream datasets. This metric underscores its potency in applications requiring rapid adaptation without retraining.

Implications and Future Directions

The broader implications of the RemoteCLIP framework are multifaceted, cutting across theoretical and practical domains:

Enabling Comprehensive GeoAI Tools: The integration of robust vision and language understanding positions RemoteCLIP as a pivotal tool in geospatial artificial intelligence. It can enhance applications like open-vocabulary object detection, image-text retrieval, and potentially inform the development of multimodal LLMs for remote sensing.
Scalable and Versatile Framework: Through its scalable data framework and foundation model approach, RemoteCLIP lays the groundwork for future developments in AI that could address more complex tasks and embrace richer modalities beyond the visual spectrum.
Future Development Potential: The authors suggest further upscaling of both model capacity and data diversity, possibly integrating generative LLMs and exploring additional sensory modalities to enrich the learning experience, thereby extending its applicability to more complex real-world scenarios.

In conclusion, the RemoteCLIP paper illustrates substantial advancements in the remote sensing field, both in methodology and applicability. Its proposed solution, combining sophisticated model architecture with innovative data scaling techniques, sets a new benchmark for future research and development within the field. As AI continues to evolve, models like RemoteCLIP present an adaptable platform capable of meeting the continuous demand for more nuanced and integrated geospatial intelligence solutions.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Fan Liu (244 papers)
Delong Chen (25 papers)
Zhangqingyun Guan (1 paper)
Xiaocong Zhou (4 papers)
Jiale Zhu (1 paper)
Jun Zhou (370 papers)
Qiaolin Ye (5 papers)
Liyong Fu (3 papers)

Citations (105)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ChenDelong1999/RemoteCLIP: 🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS) (304 stars)

Tweets

https://twitter.com/valeriomarsocci/status/1809127575207432349