PIR: Remote Sensing Image-Text Retrieval with Prior Instruction Representation Learning (2405.10160v2)
Abstract: Remote sensing image-text retrieval constitutes a foundational aspect of remote sensing interpretation tasks, facilitating the alignment of vision and language representations. This paper introduces a prior instruction representation (PIR) learning paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Based on PIR, a domain-adapted remote sensing image-text retrieval framework PIR-ITR is designed to address semantic noise issues in vision-language understanding tasks. However, with massive additional data for pre-training the vision-language foundation model, remote sensing image-text retrieval is further developed into an open-domain retrieval task. Continuing with the above, we propose PIR-CLIP, a domain-specific CLIP-based framework for remote sensing image-text retrieval, to address semantic noise in remote sensing vision-language representations and further improve open-domain retrieval performance. In vision representation, we utilize the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, we use the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise Affiliation Loss (AL) is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that PIR could enhance vision and text representations and outperform the state-of-the-art methods of closed-domain and open-domain retrieval on two benchmark datasets, RSICD and RSITMD.
- H. Guo, X. Su, C. Wu, B. Du, and L. Zhang, “Saan: Similarity-aware attention flow network for change detection with vhr remote sensing images,” IEEE Transactions on Image Processing, vol. 33, pp. 2599–2613, 2024.
- Q. He, X. Sun, W. Diao, Z. Yan, F. Yao, and K. Fu, “Multimodal remote sensing image segmentation with intuition-inspired hypergraph modeling,” IEEE Transactions on Image Processing, vol. 32, pp. 1474–1487, 2023.
- S. Wang, Y. Guan, and L. Shao, “Multi-granularity canonical appearance pooling for remote sensing scene classification,” IEEE Transactions on Image Processing, vol. 29, pp. 5396–5407, 2020.
- Y. Yao, Y. Zhang, Y. Wan, X. Liu, X. Yan, and J. Li, “Multi-modal remote sensing image matching considering co-occurrence filter,” IEEE Transactions on Image Processing, vol. 31, pp. 2584–2597, 2022.
- X. Lu, Y. Zhong, and L. Zhang, “Open-source data-driven cross-domain road detection from very high resolution remote sensing imagery,” IEEE Transactions on Image Processing, vol. 31, pp. 6847–6862, 2022.
- M. Chi, A. Plaza, J. A. Benediktsson, Z. Sun, J. Shen, and Y. Zhu, “Big data for remote sensing: Challenges and opportunities,” Proceedings of the IEEE, vol. 104, no. 11, pp. 2207–2219, 2016.
- K. E. Joyce, S. E. Belliss, S. V. Samsonov, S. J. McNeill, and P. J. Glassey, “A review of the status of satellite remote sensing and image processing techniques for mapping natural hazards and disasters,” Progress in physical geography, vol. 33, no. 2, pp. 183–207, 2009.
- X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for remote sensing image caption generation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2183–2195, 2017.
- Z. Yuan, W. Zhang, K. Fu, X. Li, C. Deng, H. Wang, and X. Sun, “Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval,” arXiv preprint arXiv:2204.09868, 2022.
- J. Rao, L. Ding, S. Qi, M. Fang, Y. Liu, L. Shen, and D. Tao, “Dynamic contrastive distillation for image-text retrieval,” IEEE Transactions on Multimedia, 2023.
- D. Feng, X. He, and Y. Peng, “Mkvse: Multimodal knowledge enhanced visual-semantic embedding for image-text retrieval,” ACM Transactions on Multimedia Computing, Communications and Applications, 2023.
- X. Xu, J. Sun, Z. Cao, Y. Zhang, X. Zhu, and H. T. Shen, “Tfun: Trilinear fusion network for ternary image-text retrieval,” Information Fusion, vol. 91, pp. 327–337, 2023.
- X. Xu, T. Wang, Y. Yang, L. Zuo, F. Shen, and H. T. Shen, “Cross-modal attention with semantic consistence for image–text matching,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 12, pp. 5412–5425, 2020.
- Y. Zhang, Z. Ji, D. Wang, Y. Pang, and X. Li, “User: Unified semantic enhancement with momentum contrast for image-text retrieval,” IEEE Transactions on Image Processing, vol. 33, pp. 595–609, 2024.
- C. Liu, Y. Zhang, H. Wang, W. Chen, F. Wang, Y. Huang, Y.-D. Shen, and L. Wang, “Efficient token-guided image-text retrieval with consistent multimodal contrastive training,” IEEE Transactions on Image Processing, vol. 32, pp. 3622–3633, 2023.
- Q. Yang, M. Ye, Z. Cai, K. Su, and B. Du, “Composed image retrieval via cross relation network with hierarchical aggregation transformer,” IEEE Transactions on Image Processing, vol. 32, pp. 4543–4554, 2023.
- Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3128–3137.
- T. Abdullah, Y. Bazi, M. M. Al Rahhal, M. L. Mekhalfi, L. Rangarajan, and M. Zuair, “Textrs: Deep bidirectional triplet network for matching text to remote sensing images,” Remote Sensing, vol. 12, no. 3, p. 405, 2020.
- Y. Lv, W. Xiong, X. Zhang, and Y. Cui, “Fusion-based correlation learning model for cross-modal remote sensing image retrieval,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
- L. Mi, S. Li, C. Chappuis, and D. Tuia, “Knowledge-aware cross-modal text-image retrieval for remote sensing images,” in Proceedings of the Second Workshop on Complex Data Challenges in Earth Observation (CDCEO 2022), 2022.
- Z. Yuan, W. Zhang, C. Tian, X. Rong, Z. Zhang, H. Wang, K. Fu, and X. Sun, “Remote sensing cross-modal text-image retrieval based on global and local information,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
- J. Pan, Q. Ma, and C. Bai, “Reducing semantic confusion: Scene-aware aggregation network for remote sensing cross-modal retrieval,” in Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, 2023, pp. 398–406.
- A. B. Dieng, C. Wang, J. Gao, and J. Paisley, “Topicrnn: A recurrent neural network with long-range semantic dependency,” arXiv preprint arXiv:1611.01702, 2016.
- Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “Gcnet: Non-local networks meet squeeze-excitation networks and beyond,” in Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, and J. Zhou, “Remoteclip: A vision language foundation model for remote sensing,” CoRR, vol. abs/2306.11029, 2023.
- Z. Zhang, T. Zhao, Y. Guo, and J. Yin, “Rs5m: A large scale vision-language dataset for remote sensing vision-language foundation model,” 2023.
- Z. Wang, R. Prabha, T. Huang, J. Wu, and R. Rajagopal, “Skyscript: A large and semantically diverse vision-language dataset for remote sensing,” arXiv preprint arXiv:2312.12856, 2023.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
- B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
- J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1833–1844.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- H. Zhang, Y. Sun, Y. Liao, S. Xu, R. Yang, S. Wang, B. Hou, and L. Jiao, “A transformer-based cross-modal image-text retrieval method using feature decoupling and reconstruction,” in IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2022, pp. 1796–1799.
- Q. Zou, L. Ni, T. Zhang, and Q. Wang, “Deep learning based feature selection for remote sensing scene classification,” IEEE Geoscience and Remote Sensing Letters, vol. 12, no. 11, pp. 2321–2325, 2015.
- J. Pan, Q. Ma, and C. Bai, “A prior instruction representation framework for remote sensing image-text retrieval,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 611–620.
- G. Mao, Y. Yuan, and L. Xiaoqiang, “Deep cross-modal retrieval for remote sensing image and audio,” in 2018 10th IAPR Workshop on Pattern Recognition in Remote Sensing (PRRS). IEEE, 2018, pp. 1–7.
- Z. Yuan, W. Zhang, C. Tian, Y. Mao, R. Zhou, H. Wang, K. Fu, and X. Sun, “Mcrn: A multi-source cross-modal retrieval network for remote sensing,” International Journal of Applied Earth Observation and Geoinformation, vol. 115, p. 103071, 2022.
- Q. Cheng, Y. Zhou, P. Fu, Y. Xu, and L. Zhang, “A deep semantic alignment network for the cross-modal image-text retrieval in remote sensing,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 4284–4297, 2021.
- L. Djoufack Basso, “Clip-rs: A cross-modal remote sensing image retrieval based on clip, a northern virginia case study,” Ph.D. dissertation, Virginia Tech, 2022.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- M. Gheini, X. Ren, and J. May, “On the strengths of cross-attention in pretrained transformers for machine translation,” arXiv preprint arXiv:2104.08771, 2021.
- R. Hou, H. Chang, B. Ma, S. Shan, and X. Chen, “Cross attention network for few-shot classification,” Advances in neural information processing systems, vol. 32, 2019.
- C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
- K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 201–216.
- X. Wei, T. Zhang, Y. Li, Y. Zhang, and F. Wu, “Multi-modality cross attention network for image and sentence matching,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 941–10 950.
- X. Xu, T. Wang, Y. Yang, L. Zuo, F. Shen, and H. T. Shen, “Cross-modal attention with semantic consistence for image–text matching,” IEEE transactions on neural networks and learning systems, vol. 31, no. 12, pp. 5412–5425, 2020.
- H. Diao, Y. Zhang, W. Liu, X. Ruan, and H. Lu, “Plug-and-play regulators for image-text matching,” IEEE Transactions on Image Processing, vol. 32, pp. 2322–2334, 2023.
- Z. Yuan, W. Zhang, X. Rong, X. Li, J. Chen, H. Wang, K. Fu, and X. Sun, “A lightweight multi-scale crossmodal text-image retrieval method in remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–19, 2021.
- C. Bai, L. Huang, X. Pan, J. Zheng, and S. Chen, “Optimization of deep convolutional neural network for large scale image retrieval,” Neurocomputing, vol. 303, pp. 60–67, 2018.
- X. Zhang, C. Bai, and K. Kpalma, “Omcbir: Offline mobile content-based image retrieval with lightweight cnn optimization,” Displays, vol. 76, p. 102355, 2023.
- C. Li, W. Xu, S. Li, and S. Gao, “Guiding generation for abstractive text summarization based on key information guide network,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 55–60.
- S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: A simple approach to controlled text generation,” arXiv preprint arXiv:1912.02164, 2019.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- J. Yang, M. Gao, Z. Li, S. Gao, F. Wang, and F. Zheng, “Track anything: Segment anything meets videos,” arXiv preprint arXiv:2304.11968, 2023.
- T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao, S. Zhao, Y. Shan et al., “Caption anything: Interactive image description with diverse multimodal controls,” arXiv preprint arXiv:2305.02677, 2023.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu, “Aid: A benchmark data set for performance evaluation of aerial scene classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, 2017.
- J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
- C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7464–7473.
- A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer et al., “Perceiver io: A general architecture for structured inputs & outputs,” arXiv preprint arXiv:2107.14795, 2021.
- P.-C. Chen, H. Tsai, S. Bhojanapalli, H. W. Chung, Y.-W. Chang, and C.-S. Ferng, “A simple and effective positional encoding for transformers,” arXiv preprint arXiv:2104.08698, 2021.
- R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.
- Y. Zeng, X. Zhang, and H. Li, “Multi-grained vision language pre-training: Aligning texts with visual concepts,” arXiv preprint arXiv:2111.08276, 2021.
- G. Li, B. Choi, J. Xu, S. S. Bhowmick, K.-P. Chun, and G. L.-H. Wong, “Shapenet: A shapelet-neural network approach for multivariate time series classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 9, 2021, pp. 8375–8383.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “Vse++: Improving visual-semantic embeddings with hard negatives,” arXiv preprint arXiv:1707.05612, 2017.
- L. Qu, M. Liu, D. Cao, L. Nie, and Q. Tian, “Context-aware multi-view summarization network for image-text matching,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1047–1055.
- W. Zhang, J. Li, S. Li, J. Chen, W. Zhang, X. Gao, and X. Sun, “Hypersphere-based remote sensing cross-modal text-image retrieval via curriculum learning,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
- Q. Ma, J. Pan, J. Chen, and C. Bai, “Direction-oriented visual-semantic embedding model for remote sensing image-text retrieval,” arXiv preprint arXiv:2310.08276, 2023.
- Y. Yuan, Y. Zhan, and Z. Xiong, “Parameter-efficient transfer learning for remote sensing image-text retrieval,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
- J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu, “Learning roi transformer for oriented object detection in aerial images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2849–2858.
- G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale dataset for object detection in aerial images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3974–3983.
- Y. Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification,” in Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, ser. GIS ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 270–279.
- G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classification: Benchmark and state of the art,” Proceedings of the IEEE, vol. 105, no. 10, pp. 1865–1883, 2017.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.