Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval (2405.18959v1)

Published 29 May 2024 in cs.CV and cs.MM

Abstract: Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain. Considering the multi-scale representations in image content and text vocabulary can enable the models to learn richer representations and enhance retrieval. Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately. This oversight restricts their ability to learn joint representations suitable for effective retrieval. We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation. Our method comprises three key innovations: (1) Multi-scale Cross-Modal Alignment Transformer (MSCMAT), which computes cross-attention between single-scale image features and localized text features, integrating global textual context to derive a matching score matrix within a mini-batch, (2) a multi-scale cross-modal semantic alignment loss that enforces semantic alignment across scales, and (3) a cross-scale multi-modal semantic consistency loss that uses the matching matrix from the largest scale to guide alignment at smaller scales. We evaluated our method across multiple datasets, demonstrating its efficacy with various visual backbones and establishing its superiority over existing state-of-the-art methods. The GitHub URL for our project is: https://github.com/yr666666/MSA

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. P. Akiva, M. Purri, and M. Leotta, “Self-supervised material and texture representation learning for remote sensing tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 8203–8215.
  2. R. Yang, S. Wang, Y. Sun, H. Zhang, Y. Liao, Y. Gu, B. Hou, and L. Jiao, “Multimodal fusion remote sensing image–audio retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 6220–6235, 2022.
  3. Z. Yuan, W. Zhang, K. Fu, X. Li, C. Deng, H. Wang, and X. Sun, “Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–19, 2021.
  4. Q. Cheng, Y. Zhou, P. Fu, Y. Xu, and L. Zhang, “A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 4284–4297, 2021.
  5. Z. Yuan, W. Zhang, C. Tian, X. Rong, Z. Zhang, H. Wang, K. Fu, and X. Sun, “Remote sensing cross-modal text-image retrieval based on global and local information,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
  6. Z. Yuan, W. Zhang, X. Rong, X. Li, J. Chen, H. Wang, K. Fu, and X. Sun, “A lightweight multi-scale crossmodal text-image retrieval method in remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–19, 2021.
  7. X. Tang, Y. Wang, J. Ma, X. Zhang, F. Liu, and L. Jiao, “Interacting-enhancing feature transformer for cross-modal remote sensing image and text retrieval,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  8. W. Zhang, J. Li, S. Li, J. Chen, W. Zhang, X. Gao, and X. Sun, “Hypersphere-based remote sensing cross-modal text-image retrieval via curriculum learning,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  9. Y. Yuan, Y. Zhan, and Z. Xiong, “Parameter-efficient transfer learning for remote sensing image-text retrieval,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  10. S. Wang, Q. Zang, D. Zhao, C. Fang, D. Quan, Y. Wan, Y. Guo, and L. Jiao, “Select, purify, and exchange: A multisource unsupervised domain adaptation method for building extraction,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2023.
  11. J. Pan, Q. Ma, and C. Bai, “A prior instruction representation framework for remote sensing image-text retrieval,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 611–620.
  12. J. Pan, Q. Ma, and C. Bai, “Reducing semantic confusion: Scene-aware aggregation network for remote sensing cross-modal retrieval,” in Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, 2023, pp. 398–406.
  13. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  14. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  15. K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 201–216.
  16. A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  17. Y. Huang, Q. Wu, C. Song, and L. Wang, “Learning semantic concepts and order for image and sentence matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  18. H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han, “Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  19. S. Chun, S. J. Oh, R. S. De Rezende, Y. Kalantidis, and D. Larlus, “Probabilistic embeddings for cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8415–8424.
  20. R. Yang, S. Wang, H. Zhang, S. Xu, Y. Guo, X. Ye, B. Hou, and L. Jiao, “Knowledge decomposition and replay: A novel cross-modal image-text retrieval continual learning method,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6510–6519.
  21. R. Yang, S. Wang, Y. Gu, J. Wang, Y. Sun, H. Zhang, Y. Liao, and L. Jiao, “Continual learning for cross-modal image-text retrieval based on domain-selective attention,” Pattern Recognition, vol. 149, p. 110273, 2024.
  22. Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao, “Camp: Cross-modal adaptive message passing for text-image retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5764–5773.
  23. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
  24. Y.-C. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in ECCV, 2020.
  25. Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers,” arXiv preprint arXiv:2004.00849, 2020.
  26. X. Li, X. Yin, C. Li, X. Hu, P. Zhang, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Object-semantics aligned pre-training for vision-language tasks,” ECCV 2020, 2020.
  27. W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5583–5594.
  28. L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, X. Huang, B. Li, C. Li et al., “Florence: A new foundation model for computer vision,” arXiv preprint arXiv:2111.11432, 2021.
  29. K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  30. L. Zhen, P. Hu, X. Wang, and D. Peng, “Deep supervised cross-modal retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 394–10 403.
  31. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  32. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
  33. M. M. A. Rahhal, Y. Bazi, N. A. Alsharif, L. Bashmal, N. Alajlan, and F. Melgani, “Multilanguage transformer for improved text to remote sensing image retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 9115–9126, 2022.
  34. H. Yu, F. Yao, W. Lu, N. Liu, P. Li, H. You, and X. Sun, “Text-image matching for cross-modal remote sensing image retrieval via graph neural network,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 812–824, 2023.
  35. C. Peng, Y. Li, L. Jiao, Y. Chen, and R. Shang, “Densely based multi-scale and multi-modal fully convolutional networks for high-resolution remote-sensing image semantic segmentation,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 8, pp. 2612–2626, 2019.
  36. K. Fu, Z. Chang, Y. Zhang, G. Xu, K. Zhang, and X. Sun, “Rotation-aware and multi-scale convolutional neural network for object detection in remote sensing images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 161, pp. 294–308, 2020.
  37. X. Tang, M. Li, J. Ma, X. Zhang, F. Liu, and L. Jiao, “Emtcal: Efficient multiscale transformer and cross-level attention learning for remote sensing scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022.
  38. J. Chen, J. Yi, A. Chen, and Z. Jin, “Efcomff-net: a multiscale feature fusion architecture with enhanced feature correlation for remote sensing image scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–17, 2023.
  39. T. Xiao, Y. Liu, Y. Huang, M. Li, and G. Yang, “Enhancing multiscale representations with transformer for remote sensing image semantic segmentation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–16, 2023.
  40. J. Zheng, A. Shao, Y. Yan, J. Wu, and M. Zhang, “Remote sensing semantic segmentation via boundary supervision aided multi-scale channel-wise cross attention network,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  41. D. Hou, S. Wang, X. Tian, and H. Xing, “An attention-enhanced end-to-end discriminative network with multiscale feature learning for remote sensing image retrieval,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 8245–8255, 2022.
  42. W. Zhou, X. Fan, L. Yu, and J. Lei, “Misnet: Multiscale cross-layer interactive and similarity refinement network for scene parsing of aerial images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 2025–2034, 2023.
  43. Q. Wang, W. Huang, Z. Xiong, and X. Li, “Looking closer at the scene: Multiscale representation learning for remote sensing image scene classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 4, pp. 1414–1428, 2020.
  44. P. Yin, D. Zhang, W. Han, J. Li, and J. Cheng, “High-resolution remote sensing image semantic segmentation via multiscale context and linear self-attention,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 9174–9185, 2022.
  45. X. Qian, C. Wang, C. Li, Z. Li, L. Zeng, W. Wang, and Q. Wu, “Multi-scale image splitting based feature enhancement and instance difficulty aware training for weakly supervised object detection in remote sensing images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023.
  46. F. Zheng, X. Wang, L. Wang, X. Zhang, H. Zhu, L. Wang, and H. Zhang, “A fine-grained semantic alignment method specific to aggregate multi-scale information for cross-modal remote sensing image retrieval,” Sensors, vol. 23, no. 20, p. 8437, 2023.
  47. Y. Chen, J. Huang, X. Li, S. Xiong, and X. Lu, “Multiscale salient alignment learning for remote sensing image-text retrieval,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
  48. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  49. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  50. M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13.   Springer, 2014, pp. 818–833.
  51. X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for remote sensing image caption generation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2183–2195, 2017.
  52. B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of high resolution remote sensing image,” in 2016 International conference on computer, information and telecommunication systems (Cits).   IEEE, 2016, pp. 1–5.
  53. F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “Vse++: Improving visual-semantic embeddings with hard negatives,” arXiv preprint arXiv:1707.05612, 2017.
  54. T. Wang, X. Xu, Y. Yang, A. Hanjalic, H. T. Shen, and J. Song, “Matching images and text with multi-modal tensor fusion and re-ranking,” in Proceedings of the 27th ACM international conference on multimedia, 2019, pp. 12–20.
  55. Z. Wang, X. Liu, H. Li, L. Sheng, J. Yan, X. Wang, and J. Shao, “Camp: Cross-modal adaptive message passing for text-image retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  56. L. Qu, M. Liu, D. Cao, L. Nie, and Q. Tian, “Context-aware multi-view summarization network for image-text matching,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1047–1055.
  57. Z. Yuan, W. Zhang, C. Tian, Y. Mao, R. Zhou, H. Wang, K. Fu, and X. Sun, “Mcrn: A multi-source cross-modal retrieval network for remote sensing,” International Journal of Applied Earth Observation and Geoinformation, vol. 115, p. 103071, 2022.
  58. F. Zheng, W. Li, X. Wang, L. Wang, X. Zhang, and H. Zhang, “A cross-attention mechanism based on regional-level semantic features of images for cross-modal text-image retrieval in remote sensing,” Applied Sciences, vol. 12, no. 23, p. 12221, 2022.
  59. F. Yao, X. Sun, N. Liu, C. Tian, L. Xu, L. Hu, and C. Ding, “Hypergraph-enhanced textual-visual matching network for cross-modal remote sensing image retrieval via dynamic hypergraph learning,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 688–701, 2022.
  60. C. Zheng, N. Song, R. Zhang, L. Huang, Z. Wei, and J. Nie, “Scale-semantic joint decoupling network for image-text retrieval in remote sensing,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 1, pp. 1–20, 2023.
  61. S. Zhang, Y. Li, and S. Mei, “Exploring uni-modal feature learning on entities and relations for remote sensing cross-modal text-image retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–17, 2023.
  62. Y. Chen, J. Huang, S. Xiong, and X. Lu, “Integrating multisubspace joint learning with multilevel guidance for cross-modal retrieval of remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–17, 2024.
  63. Z. Ji, C. Meng, Y. Zhang, Y. Pang, and X. Li, “Knowledge-aided momentum contrastive learning for remote-sensing image text retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023.
  64. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  65. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Rui Yang (221 papers)
  2. Shuang Wang (159 papers)
  3. Yingping Han (1 paper)
  4. Yuanheng Li (1 paper)
  5. Dong Zhao (50 papers)
  6. Dou Quan (5 papers)
  7. Yanhe Guo (1 paper)
  8. Licheng Jiao (109 papers)
Citations (1)