Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Composed Image Retrieval for Remote Sensing (2405.15587v3)

Published 24 May 2024 in cs.CV

Abstract: This work introduces composed image retrieval to remote sensing. It allows to query a large image archive by image examples alternated by a textual description, enriching the descriptive power over unimodal queries, either visual or textual. Various attributes can be modified by the textual part, such as shape, color, or context. A novel method fusing image-to-image and text-to-image similarity is introduced. We demonstrate that a vision-LLM possesses sufficient descriptive power and no further learning step or training data are necessary. We present a new evaluation benchmark focused on color, context, density, existence, quantity, and shape modifications. Our work not only sets the state-of-the-art for this task, but also serves as a foundational step in addressing a gap in the field of remote sensing image retrieval. Code at: https://github.com/billpsomas/rscir

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. “An environment for content-based image retrieval from large spatial databases,” ISPRS Journal of Photogrammetry and Remote Sensing, 1999.
  2. “Remote sensing image retrieval in the past decade: Achievements, challenges, and future directions,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023.
  3. “Exploiting low dimensional features from the mobilenets for remote sensing image retrieval,” Earth Science Informatics, vol. 13, pp. 1437–1443, 2020.
  4. “A learnable joint spatial and spectral transformation for high resolution remote sensing image retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 8100–8112, 2021.
  5. “Plasticity-stability preserving multi-task learning for remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
  6. “A novel multi-attention fusion network with dilated convolution and label smoothing for remote sensing image retrieval,” International Journal of Remote Sensing, vol. 43, no. 4, pp. 1306–1322, 2022.
  7. “Global-aware ranking deep metric learning for remote sensing image retrieval,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
  8. “A novel ensemble architecture of residual attention-based deep metric learning for remote sensing image retrieval,” Remote Sensing, vol. 13, no. 17, pp. 3445, 2021.
  9. “Graph relation network: Modeling relations between scenes for multilabel remote-sensing image classification and retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 5, pp. 4355–4369, 2020.
  10. “A novel graph-theoretic deep representation learning method for multi-label remote sensing image retrieval,” in 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS. IEEE, 2021, pp. 266–269.
  11. “Informative and representative triplet selection for multilabel remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2021.
  12. “A semantic-preserving deep hashing model for multi-label remote sensing image retrieval,” Remote Sensing, vol. 13, no. 24, pp. 4965, 2021.
  13. “Toward multilabel image retrieval for remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
  14. “Multilabel remote sensing image retrieval based on fully convolutional network,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 318–328, 2020.
  15. “Composing text and image for image retrieval-an empirical odyssey,” in CVPR, 2019.
  16. “Image search with text feedback by visiolinguistic attention learning,” in CVPR, 2020.
  17. “Composed query image retrieval using locally bounded features,” in CVPR, 2020.
  18. “Effective conditioned and composed image retrieval combining clip-based features,” in CVPR, 2022.
  19. “Cosmo: Content-style modulation for image retrieval with text feedback,” in CVPR, 2021.
  20. “Pic2word: Mapping pictures to words for zero-shot composed image retrieval,” in CVPR, 2023.
  21. YN Mamatha and AG Ananth, “Content based image retrieval of satellite imageries using soft query based color composite techniques,” International Journal of Computer Applications, vol. 7, no. 5, pp. 0975–8887, 2010.
  22. “An improved svm model for relevance feedback in remote sensing image retrieval,” International Journal of Digital Earth, vol. 7, no. 9, pp. 725–745, 2014.
  23. “Fuzzy content-based image retrieval for oceanic remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 9, pp. 5422–5431, 2013.
  24. “Integrated spectral and spatial information mining in remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 42, no. 3, pp. 673–685, 2004.
  25. “Modeling and detection of geospatial objects using texture motifs,” IEEE Transactions on Geoscience and Remote Sensing, vol. 44, no. 12, pp. 3706–3715, 2006.
  26. “Remote sensing image retrieval by scene semantic matching,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 5, pp. 2874–2886, 2012.
  27. “Remote sensing image retrieval with combined features of salient region,” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 40, pp. 83–88, 2014.
  28. “Remote-sensing image retrieval by combining image visual and semantic features,” International journal of remote sensing, vol. 34, no. 12, pp. 4200–4223, 2013.
  29. “Multilabel remote sensing image retrieval using a semisupervised graph-theoretic method,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 1144–1158, 2017.
  30. “A novel system for content based retrieval of multi-label remote sensing images,” in 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2017, pp. 1744–1747.
  31. “Content-based high-resolution remote sensing image retrieval via unsupervised feature learning and collaborative affinity metric fusion,” Remote Sensing, vol. 8, no. 9, pp. 709, 2016.
  32. “Delving into deep representations for remote sensing image retrieval,” in 2016 IEEE 13th International Conference on Signal Processing (ICSP). IEEE, 2016, pp. 198–203.
  33. “Enhanced interactive remote sensing image retrieval with scene classification convolutional neural networks model,” in IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2018, pp. 4748–4751.
  34. “Remote sensing image retrieval using convolutional neural network features and weighted distance,” IEEE geoscience and remote sensing letters, vol. 15, no. 10, pp. 1535–1539, 2018.
  35. Paolo Napoletano, “Visual descriptors for content-based retrieval of remote-sensing images,” International journal of remote sensing, vol. 39, no. 5, pp. 1343–1376, 2018.
  36. “Exploiting representations from pre-trained convolutional neural networks for high-resolution remote sensing image retrieval,” Multimedia Tools and Applications, vol. 77, pp. 17489–17515, 2018.
  37. “Unsupervised deep feature learning for remote sensing image retrieval,” Remote Sensing, vol. 10, no. 8, pp. 1243, 2018.
  38. “Aggregated deep local features for remote sensing image retrieval,” Remote Sensing, vol. 11, no. 5, pp. 493, 2019.
  39. “Scalable database indexing and fast image retrieval based on deep learning and hierarchically nested structure applied to remote sensing and plant biology,” Journal of Imaging, vol. 5, no. 3, pp. 33, 2019.
  40. “Learning low dimensional convolutional neural networks for high-resolution remote sensing image retrieval,” Remote Sensing, vol. 9, no. 5, pp. 489, 2017.
  41. “A triplet nonlocal neural network with dual-anchor triplet loss for high-resolution remote sensing image retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 2711–2723, 2021.
  42. “Remote sensing image retrieval with gabor-ca-resnet and split-based deep feature transform network,” Remote Sensing, vol. 13, no. 5, pp. 869, 2021.
  43. “Remote-sensing image retrieval with tree-triplet-classification networks,” Neurocomputing, vol. 405, pp. 48–61, 2020.
  44. “A three-layered graph-based learning approach for remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 10, pp. 6020–6034, 2016.
  45. “Siamese graph convolutional network for content based remote sensing image retrieval,” Computer vision and image understanding, 2019.
  46. “Attention boosted bilinear pooling for remote sensing image retrieval,” International Journal of Remote Sensing, vol. 41, no. 7, pp. 2704–2724, 2020.
  47. “A discriminative feature learning approach for remote sensing image retrieval,” Remote Sensing, vol. 11, no. 3, pp. 281, 2019.
  48. “Attention-driven graph convolution network for remote sensing image retrieval,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
  49. “Enhancing remote sensing image retrieval using a triplet deep metric learning network,” International Journal of Remote Sensing, vol. 41, no. 2, pp. 740–751, 2020.
  50. “Eagle-eyed multitask cnns for aerial image retrieval and scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 9, pp. 6699–6721, 2020.
  51. “Global optimization: Combining local loss with result ranking loss in remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 8, pp. 7011–7026, 2020.
  52. “Similarity-based unsupervised deep transfer learning for remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 11, pp. 7872–7889, 2020.
  53. “Learning source-invariant deep hashing convolutional neural networks for cross-source remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 11, pp. 6521–6536, 2018.
  54. “Cross-source image retrieval based on ensemble learning and knowledge distillation for remote sensing images,” in 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS. IEEE, 2021, pp. 2803–2806.
  55. “A discriminative distillation network for cross-source remote sensing image retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 1234–1247, 2020.
  56. “Learning to translate for cross-source remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 7, pp. 4860–4874, 2020.
  57. “Mental retrieval of remote sensing images via adversarial sketch-image feature learning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 11, pp. 7801–7814, 2020.
  58. “Multisensor fusion and explicit semantic preserving-based deep hashing for cross-modal remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
  59. “Bigearthnet-mm: A large-scale, multimodal, multilabel benchmark archive for remote sensing image classification and retrieval [software and data sets],” IEEE Geoscience and Remote Sensing Magazine, vol. 9, no. 3, pp. 174–180, 2021.
  60. “Fusion-based correlation learning model for cross-modal remote sensing image retrieval,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
  61. “Remote sensing cross-modal text-image retrieval based on global and local information,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
  62. “Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval,” arXiv preprint arXiv:2204.09868, 2022.
  63. “Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7258–7267.
  64. “Geo-localization via ground-to-satellite cross-view image retrieval,” IEEE Transactions on Multimedia, 2022.
  65. “Cross-time and orientation-invariant overhead image geolocalization using deep local features,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 2512–2520.
  66. “Learning deep representations for ground-to-aerial geolocalization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5007–5015.
  67. “Cross-view image retrieval-ground to aerial image retrieval through deep learning,” in International Conference on Neural Information Processing. Springer, 2019, pp. 210–221.
  68. “Beyond cross-view image retrieval: Highly accurate vehicle localization using satellite image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17010–17020.
  69. “Fine-tuning cnn image retrieval with no human annotation,” PAMI, 2019.
  70. “Deep image retrieval: Learning global representations for image search,” in ECCV, 2016.
  71. “Large-scale image retrieval with attentive deep local features,” in ICCV, 2017.
  72. “Adversarial representation learning for text-to-image matching,” in ICCV, 2019, pp. 5814–5824.
  73. “Context-aware attention network for image-text retrieval,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 3536–3545.
  74. “Devise: A deep visual-semantic embedding model,” NeurIPS, vol. 26, 2013.
  75. “Learning joint visual semantic matching embeddings for language-guided retrieval,” in ECCV, 2020.
  76. “Disentangled non-local neural networks,” in ECCV, 2020.
  77. “Faster r-cnn: Towards real-time object detection with region proposal networks,” NeurIPS, 2015.
  78. “Artemis: Attention-based retrieval with text-explicit matching and implicit similarity,” 2022.
  79. “Automatic spatially-aware fashion concept discovery,” in ICCV, 2017.
  80. “Automatic attribute discovery and characterization from noisy web data,” in ECCV. Springer, 2010.
  81. “Fashion iq: A new dataset towards retrieving images by natural language feedback,” in CVPR, 2021.
  82. “Discovering states and transformations in image collections,” in CVPR, 2015, pp. 1383–1391.
  83. “Microsoft coco: Common objects in context,” in ECCV, 2014.
  84. “Probabilistic compositional embeddings for multimodal image retrieval,” in CVPR, 2022.
  85. “Learning transferable visual models from natural language supervision,” in ICML. PMLR, 2021.
  86. “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
  87. “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, 2022.
  88. “Zero-shot composed image retrieval with textual inversion,” in ICCV, 2023.
  89. “Vision-by-language for training-free compositional image retrieval,” 2023, arXiv.
  90. “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  91. “Laion-5b: An open large-scale dataset for training next generation image-text models,” NeurIPS, 2022.
  92. “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv preprint arXiv:2104.13921, 2021.
  93. “Open-vocabulary semantic segmentation with mask-adapted clip,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7061–7070.
  94. “Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17918–17928.
  95. “Medclip: Contrastive learning from unpaired medical images and text,” arXiv preprint arXiv:2210.10163, 2022.
  96. “Remoteclip: A vision language foundation model for remote sensing,” arXiv preprint arXiv:2306.11029, 2023.
  97. “Satclip: Global, general-purpose location embeddings with satellite imagery,” arXiv preprint arXiv:2311.17179, 2023.
  98. “Patternnet: A benchmark dataset for performance evaluation of remote sensing image retrieval,” ISPRS journal of photogrammetry and remote sensing, vol. 145, pp. 197–209, 2018.
Citations (1)

Summary

  • The paper introduces FreeDom, an innovative method that flexibly integrates image and text inputs for improved remote sensing retrieval.
  • It leverages vision-language models like CLIP and RemoteCLIP with a tunable lambda to balance visual and textual query components.
  • Experiments on the PatternCom benchmark show FreeDom outperforms baselines by 8.50% to 11.66% mAP, marking a significant advance in the field.

Remote Sensing Composed Image Retrieval: Integrating Image and Text in Query Formulations

The paper presents a novel approach to remote sensing image retrieval (RSIR) by integrating both image and textual descriptions in the query formulation, referred to as remote sensing composed image retrieval (RSCIR). This advancement addresses a fundamental limitation of traditional RSIR systems that search images based on unimodal inputs, which often restrict users from fully expressing complex and dynamic requirements associated with Earth's observed phenomena.

Methodology

The introduced method leverages vision-LLMs (VLMs) and proposes FreeDom, a training-free approach that allows a flexible weighting between image and text components of a query. This modal control is parameterized by λ\lambda, where the query can range from entirely image-based (λ=0\lambda = 0) to entirely text-based (λ=1\lambda = 1).

The VLM employed in this paper includes CLIP and RemoteCLIP models capable of mapping both image and text inputs into a shared embedding space. This dual-encoder architecture is instrumental in the FreeDom method, ensuring that both modalities contribute effectively to the retrieval process. The similarity normalization process, which transforms similarity scores into a uniform distribution, plays a crucial role in balancing the influences of both modalities. This normalization diminishes the dominance of unimodal similarities, enhancing the retrieval's responsiveness to the nuances of combined queries.

Experimental Setup

The paper introduces PatternCom, a benchmark dataset derived from the PatternNet dataset, tailored specifically for evaluating the composed image retrieval task. PatternCom includes multiple attributes such as color, shape, and texture, combining them with corresponding attribute values across various classes. This setup provides a comprehensive testbed for assessing the performance and versatility of the proposed retrieval method.

Results

Empirical results demonstrate that FreeDom significantly outperforms both unimodal and basic multimodal baselines. For instance, FreeDom surpasses the second-best baseline by 8.50% mean average precision (mAP) using CLIP and by 11.66% mAP using RemoteCLIP. These findings validate the enhanced retrieval capabilities afforded by integrating textual descriptions with visual queries.

Implications and Future Work

The implications of this research are multifaceted. Practically, the ability to query remote sensing archives using composed image-text queries enhances user expressiveness and retrieval accuracy, aligning closely with professional analysts' needs when dealing with complex geographical data. This holds substantial utility for tasks requiring precise image retrieval based on specific attributes, such as disaster response, urban planning, and environmental monitoring.

Theoretically, this approach highlights the potential of VLMs in remote sensing applications beyond traditional unimodal tasks. By demonstrating that training-free models can effectively handle composed queries, this work suggests promising directions for future research. One potential area is the exploration of fine-tuning VLMs specifically on remote sensing data to further enhance retrieval accuracy. Additionally, extending the benchmark dataset to include more diverse and complex attributes or incorporating temporal dimensions could provide further insights into the model's capabilities and limitations.

In conclusion, the introduced method and benchmark represent a significant step towards more expressive and powerful remote sensing image retrieval systems. The FreeDom method, with its flexible and training-free nature, sets a new state-of-the-art for the task, showcasing the effectiveness of integrating vision and LLMs in remote sensing applications. The research paves the way for future enhancements and broader applications of composed image retrieval in the field.