Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Composed Image Retrieval for Remote Sensing (2405.15587v3)

Published 24 May 2024 in cs.CV

Abstract: This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-LLM possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir

References (98)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces FreeDom, an innovative method that flexibly integrates image and text inputs for improved remote sensing retrieval.
It leverages vision-language models like CLIP and RemoteCLIP with a tunable lambda to balance visual and textual query components.
Experiments on the PatternCom benchmark show FreeDom outperforms baselines by 8.50% to 11.66% mAP, marking a significant advance in the field.

Remote Sensing Composed Image Retrieval: Integrating Image and Text in Query Formulations

The paper presents a novel approach to remote sensing image retrieval (RSIR) by integrating both image and textual descriptions in the query formulation, referred to as remote sensing composed image retrieval (RSCIR). This advancement addresses a fundamental limitation of traditional RSIR systems that search images based on unimodal inputs, which often restrict users from fully expressing complex and dynamic requirements associated with Earth's observed phenomena.

Methodology

The introduced method leverages vision-LLMs (VLMs) and proposes FreeDom, a training-free approach that allows a flexible weighting between image and text components of a query. This modal control is parameterized by $\lambda$ , where the query can range from entirely image-based ( $\lambda = 0$ ) to entirely text-based ( $\lambda = 1$ ).

The VLM employed in this paper includes CLIP and RemoteCLIP models capable of mapping both image and text inputs into a shared embedding space. This dual-encoder architecture is instrumental in the FreeDom method, ensuring that both modalities contribute effectively to the retrieval process. The similarity normalization process, which transforms similarity scores into a uniform distribution, plays a crucial role in balancing the influences of both modalities. This normalization diminishes the dominance of unimodal similarities, enhancing the retrieval's responsiveness to the nuances of combined queries.

Experimental Setup

The paper introduces PatternCom, a benchmark dataset derived from the PatternNet dataset, tailored specifically for evaluating the composed image retrieval task. PatternCom includes multiple attributes such as color, shape, and texture, combining them with corresponding attribute values across various classes. This setup provides a comprehensive testbed for assessing the performance and versatility of the proposed retrieval method.

Results

Empirical results demonstrate that FreeDom significantly outperforms both unimodal and basic multimodal baselines. For instance, FreeDom surpasses the second-best baseline by 8.50% mean average precision (mAP) using CLIP and by 11.66% mAP using RemoteCLIP. These findings validate the enhanced retrieval capabilities afforded by integrating textual descriptions with visual queries.

Implications and Future Work

The implications of this research are multifaceted. Practically, the ability to query remote sensing archives using composed image-text queries enhances user expressiveness and retrieval accuracy, aligning closely with professional analysts' needs when dealing with complex geographical data. This holds substantial utility for tasks requiring precise image retrieval based on specific attributes, such as disaster response, urban planning, and environmental monitoring.

Theoretically, this approach highlights the potential of VLMs in remote sensing applications beyond traditional unimodal tasks. By demonstrating that training-free models can effectively handle composed queries, this work suggests promising directions for future research. One potential area is the exploration of fine-tuning VLMs specifically on remote sensing data to further enhance retrieval accuracy. Additionally, extending the benchmark dataset to include more diverse and complex attributes or incorporating temporal dimensions could provide further insights into the model's capabilities and limitations.

In conclusion, the introduced method and benchmark represent a significant step towards more expressive and powerful remote sensing image retrieval systems. The FreeDom method, with its flexible and training-free nature, sets a new state-of-the-art for the task, showcasing the effectiveness of integrating vision and LLMs in remote sensing applications. The research paves the way for future enhancements and broader applications of composed image retrieval in the field.

PDF Markdown

GitHub

GitHub - billpsomas/rscir: Official PyTorch implementation and benchmark dataset for IGARSS 2024 ORAL paper: "Composed Image Retrieval for Remote Sensing" (44 stars)

Tweets

https://twitter.com/ducha_aiki/status/1796482125400887690

https://twitter.com/bill_psomas/status/1795739154313257264