Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning (2312.01191v1)
Abstract: Recently, remote sensing image captioning has gained significant attention in the remote sensing community. Due to the significant differences in spatial resolution of remote sensing images, existing methods in this field have predominantly concentrated on the fine-grained extraction of remote sensing image features, but they cannot effectively handle the semantic consistency between visual features and textual features. To efficiently align the image-text, we propose a novel two-stage vision-language pre-training-based approach to bootstrap interactive image-text alignment for remote sensing image captioning, called BITA, which relies on the design of a lightweight interactive Fourier Transformer to better align remote sensing image-text features. The Fourier layer in the interactive Fourier Transformer is capable of extracting multi-scale features of remote sensing images in the frequency domain, thereby reducing the redundancy of remote sensing visual features. Specifically, the first stage involves preliminary alignment through image-text contrastive learning, which aligns the learned multi-scale remote sensing features from the interactive Fourier Transformer with textual features. In the second stage, the interactive Fourier Transformer connects the frozen image encoder with a LLM. Then, prefix causal LLMing is utilized to guide the text generation process using visual features. Ultimately, across the UCM-caption, RSICD, and NWPU-caption datasets, the experimental results clearly demonstrate that BITA outperforms other advanced comparative approaches. The code is available at https://github.com/yangcong356/BITA.
- Z. Shi and Z. Zou, “Can a machine generate humanlike language descriptions for a remote sensing image?” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 6, pp. 3623–3634, 2017.
- D. Hong, L. Gao, N. Yokoya, J. Yao, J. Chanussot, Q. Du, and B. Zhang, “More diverse means better: Multimodal deep learning meets remote-sensing imagery classification,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 5, pp. 4340–4354, 2021.
- L. Zhang, L. Zhang, and B. Du, “Deep learning for remote sensing data: A technical tutorial on the state of the art,” IEEE Geosci. Remote Sens. Mag., vol. 4, no. 2, pp. 22–40, 2016.
- L. Zhang and L. Zhang, “Artificial intelligence for remote sensing data analysis: A review of challenges and opportunities,” IEEE Geosci. Remote Sens. Mag., vol. 10, no. 2, pp. 270–294, 2022.
- S. K. Roy, A. Deria, D. Hong, B. Rasti, A. Plaza, and J. Chanussot, “Multimodal fusion transformer for remote sensing image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–20, 2023.
- J. Yao, D. Hong, H. Wang, H. Liu, and J. Chanussot, “UCSL: Toward unsupervised common subspace learning for cross-modal image classification,” IEEE Trans. Geosci. Remote Sens., vol. 61, pp. 1–12, 2023.
- Y. Dong, T. Liang, Y. Zhang, and B. Du, “Spectral–spatial weighted kernel manifold embedded distribution alignment for remote sensing image classification,” IEEE Trans. Cybern., vol. 51, no. 6, pp. 3185–3197, 2021.
- Y. Liu, Q. Li, Y. Yuan, Q. Du, and Q. Wang, “ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–14, 2022.
- X. Sun, P. Wang, C. Wang, Y. Liu, and K. Fu, “PBNet: Part-based convolutional neural network for complex composite object detection in remote sensing imagery,” ISPRS J. Photogramm. Remote Sens., vol. 173, pp. 50–65, 2021.
- C. Wen, Y. Hu, X. Li, Z. Yuan, and X. X. Zhu, “Vision-language models in remote sensing: Current progress and future trends,” arXiv preprint arXiv:2305.05726, 2023.
- X. Lu, B. Wang, and X. Zheng, “Sound active attention framework for remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 58, no. 3, pp. 1985–2000, 2020.
- X. Zhang, X. Li, J. An, L. Gao, B. Hou, and C. Li, “Natural language description of remote sensing images based on deep learning,” in IEEE Int. Geosci. Remote Sens. Symp. (IGARSS), 2017, pp. 4798–4801.
- R. Zhao, Z. Shi, and Z. Zou, “High-resolution remote sensing image captioning based on structured attention,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–14, 2022.
- B. Wang, X. Lu, X. Zheng, and X. Li, “Semantic descriptions of high-resolution remote sensing images,” IEEE Geosci. Remote. Sens. Lett., vol. 16, no. 8, pp. 1274–1278, 2019.
- G. Hoxha, S. Chouaf, F. Melgani, and Y. Smara, “Change captioning: A new paradigm for multitemporal remote sensing image analysis,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022.
- Q. Yang, Z. Ni, and P. Ren, “Meta captioning: A meta learning based remote sensing image captioning framework,” ISPRS J. Photogramm. Remote Sens., vol. 186, pp. 190–200, 2022.
- Q. Wang, W. Huang, X. Zhang, and X. Li, “Word–sentence framework for remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 12, pp. 10 532–10 543, 2021.
- W. Huang, Q. Wang, and X. Li, “Denoising-based multiscale feature fusion for remote sensing image captioning,” IEEE Geosci. Remote. Sens. Lett., vol. 18, no. 3, pp. 436–440, 2021.
- X. Li, X. Zhang, W. Huang, and Q. Wang, “Truncation cross entropy loss for remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 6, pp. 5246–5257, 2021.
- G. Sumbul, S. Nayak, and B. Demir, “SD-RSIC: Summarization-driven deep remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 8, pp. 6922–6934, 2021.
- G. Hoxha and F. Melgani, “A novel svm-based decoder for remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–14, 2022.
- C. Liu, R. Zhao, and Z. Shi, “Remote-sensing image captioning based on multilayer aggregated transformer,” IEEE Geosci. Remote. Sens. Lett., vol. 19, pp. 1–5, 2022.
- H. Kandala, S. Saha, B. Banerjee, and X. X. Zhu, “Exploring transformer and multilabel classification for remote sensing image captioning,” IEEE Geosci. Remote. Sens. Lett., vol. 19, pp. 1–5, 2022.
- Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, and J. Gao, “Vision-language pre-training: Basics, recent advances, and future trends,” Found. Trends Comput. Graph. Vis., vol. 14, no. 3-4, pp. 163–352, 2022.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485–5551, 2020.
- S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “OPT: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2020, pp. 1877–1901.
- J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
- J. Lee-Thorp, J. Ainslie, I. Eckstein, and S. Ontañón, “FNet: Mixing tokens with fourier transforms,” in Proc. 2022 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. (NAACL), Jul. 2022, pp. 4296–4313.
- H. Chen, N. Yokoya, and M. Chini, “Fourier domain structural relationship analysis for unsupervised multimodal change detection,” ISPRS J. Photogramm. Remote Sens., vol. 198, pp. 99–114, 2023.
- J. Xi, O. K. Ersoy, M. Cong, C. Zhao, W. Qu, and T. Wu, “Wide and deep fourier neural network for hyperspectral remote sensing image classification,” Remote. Sen., vol. 14, no. 12, p. 2931, 2022.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 139, Jul. 2021, pp. 8748–8763.
- L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H. Hon, “Unified language model pre-training for natural language understanding and generation,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2019, pp. 13 042–13 054.
- Y. Rao, W. Zhao, Z. Zhu, J. Lu, and J. Zhou, “Global filter networks for image classification,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2021, pp. 980–993.
- Y. Wang, W. Zhang, Z. Zhang, X. Gao, and X. Sun, “Multiscale multiinteraction network for remote sensing image captioning,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 15, pp. 2154–2165, 2022.
- Y. Li, X. Zhang, J. Gu, C. Li, X. Wang, X. Tang, and L. Jiao, “Recurrent attention and semantic gate for remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–16, 2022.
- Q. Wang, W. Huang, X. Zhang, and X. Li, “GLCM: Global–local captioning model for remote sensing image captioning,” IEEE Trans. Cybern., vol. 53, no. 11, pp. 6910–6922, 2023.
- Z. Zhang, W. Zhang, M. Yan, X. Gao, K. Fu, and X. Sun, “Global visual feature and linguistic state guided attention for remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–16, 2022.
- P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 6077–6086.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), Dec. 2019, pp. 13–23.
- H. Tan and M. Bansal, “LXMERT: Learning cross-modality encoder representations from transformers,” in Proc. 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. (EMNLP-IJCNLP), Nov. 2019, pp. 5099–5110.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. 2022 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. (NAACL), Jun. 2019, pp. 4171–4186.
- G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in Proc. AAAI Conf. Artif. Intell., Feb, Feb. 2020, pp. 11 336–11 344.
- X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Aug. 2020, pp. 121–137.
- P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “VinVL: Revisiting visual representations in vision-language models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 5579–5588.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proc. Int. Conf. Learn. Represent. (ICLR), May. 2021.
- Y. Wang, J. Xu, and Y. Sun, “End-to-end transformer based model for image captioning,” in Proc. AAAI Conf. Artif. Intell., Feb. 2022, pp. 2585–2594.
- Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers,” arXiv preprint arXiv:2004.00849, 2020.
- J. Li, T. Tang, W. X. Zhao, J. Nie, and J. Wen, “A survey of pretrained language models based text generation,” arXiv preprint arXiv:2201.05273, 2022.
- Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM: Simple visual language model pretraining with weak supervision,” in Proc. Int. Conf. Learn. Represent. (ICLR), Apr. 2022.
- J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language model for few-shot learning,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022, pp. 23 716–23 736.
- J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex fourier series,” Mathematics of computation, vol. 19, no. 90, pp. 297–301, 1965.
- Y. Dong, J. Cordonnier, and A. Loukas, “Attention is not all you need: pure attention loses rank doubly exponentially with depth,” in Proc. Int. Conf. Mach. Learn. (ICML), vol. 139, Jul. 2021, pp. 2793–2803.
- Y. Tay, M. Dehghani, V. Q. Tran, X. Garcia, J. Wei, X. Wang, H. W. Chung, D. Bahri, T. Schuster, H. S. Zheng, D. Zhou, N. Houlsby, and D. Metzler, “UL2: unifying language learning paradigms,” in Proc. Int. Conf. Learn. Represent. (ICLR), May. 2023.
- D. Dai, Y. Sun, L. Dong, Y. Hao, S. Ma, Z. Sui, and F. Wei, “Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers,” in Proc. Annu. Meeting Assoc. Comput. Linguist. (ACL), Jul. 2023.
- X. Hu, Z. Gan, J. Wang, Z. Yang, Z. Liu, Y. Lu, and L. Wang, “Scaling up vision-language pretraining for image captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 17 959–17 968.
- Q. Chen, C. Deng, and Q. Wu, “Learning distinct and representative modes for image captioning,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2022, pp. 9472–9485.
- Z. Yang, Y. Lu, J. Wang, X. Yin, D. Florencio, L. Wang, C. Zhang, L. Zhang, and J. Luo, “Tap: Text-aware pre-training for text-vqa and text-caption,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 8751–8761.
- L. Zhou, H. Palangi, L. Zhang, H. Hu, J. J. Corso, and J. Gao, “Unified vision-language pre-training for image captioning and VQA,” in Proc. AAAI Conf. Artif. Intell., Feb. 2020, pp. 13 041–13 049.
- J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, “Visualgpt: Data-efficient adaptation of pretrained language models for image captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 18 030–18 040.
- M. Barraco, M. Cornia, S. Cascianelli, L. Baraldi, and R. Cucchiara, “The unreasonable effectiveness of clip features for image captioning: An experimental analysis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) Worksh., Jun. 2022, pp. 4662–4670.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), Dec. 2017, pp. 5998–6008.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proc. Annu. Meeting Assoc. Comput. Linguist. (ACL), Jul 2002, pp. 311–318.
- S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Worksh. Intrins. Extrins. Eval. Meas. Mach. Transl. Summar., Jun 2005, pp. 65–72.
- C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proc. Worksh. Text Summar. Branches Out, 2004, pp. 74–81.
- R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4566–4575.
- P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Oct. 2016, pp. 382–398.
- X. Lu, B. Wang, X. Zheng, and X. Li, “Exploring models and data for remote sensing image caption generation,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 4, pp. 2183–2195, 2018.
- Q. Cheng, H. Huang, Y. Xu, Y. Zhou, H. Li, and Z. Wang, “NWPU-captions dataset and MLCA-net for remote sensing image captioning,” IEEE Trans. Geosci. Remote Sens., vol. 60, pp. 1–19, 2022.
- B. Qu, X. Li, D. Tao, and X. Lu, “Deep semantic understanding of high resolution remote sensing image,” in Proc. Int. Conf. Comput. Inf. Telecommun. Syst. (CITS), Jul 2016, pp. 1–5.
- Cong Yang (22 papers)
- Zuchao Li (76 papers)
- Lefei Zhang (64 papers)