Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification (2403.08271v2)
Abstract: Fine-grained ship classification in remote sensing (RS-FGSC) poses a significant challenge due to the high similarity between classes and the limited availability of labeled data, limiting the effectiveness of traditional supervised classification methods. Recent advancements in large pre-trained Vision-LLMs (VLMs) have demonstrated impressive capabilities in few-shot or zero-shot learning, particularly in understanding image content. This study delves into harnessing the potential of VLMs to enhance classification accuracy for unseen ship categories, which holds considerable significance in scenarios with restricted data due to cost or privacy constraints. Directly fine-tuning VLMs for RS-FGSC often encounters the challenge of overfitting the seen classes, resulting in suboptimal generalization to unseen classes, which highlights the difficulty in differentiating complex backgrounds and capturing distinct ship features. To address these issues, we introduce a novel prompt tuning technique that employs a hierarchical, multi-granularity prompt design. Our approach integrates remote sensing ship priors through bias terms, learned from a small trainable network. This strategy enhances the model's generalization capabilities while improving its ability to discern intricate backgrounds and learn discriminative ship features. Furthermore, we contribute to the field by introducing a comprehensive dataset, FGSCM-52, significantly expanding existing datasets with more extensive data and detailed annotations for less common ship classes. Extensive experimental evaluations demonstrate the superiority of our proposed method over current state-of-the-art techniques. The source code will be made publicly available.
- N. Wang, B. Li, X. Wei, Y. Wang, and H. Yan, “Ship detection in spaceborne infrared image based on lightweight cnn and multisource feature cascade decision,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 5, pp. 4324–4339, 2020.
- J. Chen and Y. Qian, “Hierarchical multilabel ship classification in remote sensing images using label relation graphs,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2021.
- X. Zhang, Y. Lv, L. Yao, W. Xiong, and C. Fu, “A new benchmark and an attribute-guided multilevel feature representation network for fine-grained ship classification in optical remote sensing images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 1271–1285, 2020.
- Y. Di, Z. guo Jiang, and H. Zhang, “A public dataset for fine-grained ship classification in optical remote sensing images,” Remote Sensing, vol. 13, p. 747, 2021.
- L. Huang, F. Wang, Y. Zhang, and Q.-Y. Xu, “Fine-grained ship classification by combining cnn and swin transformer,” Remote Sensing, vol. 14, no. 13, p. 3087, 2022.
- W. Zhao, T. Tong, L. Yao, Y. Liu, C. Xu, Y. He, and H. Lu, “Feature balance for fine-grained object classification in aerial images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 3022.
- J. Chen and Y. Qian, “Hierarchical multilabel ship classification in remote sensing images using label relation graphs,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
- Y. Han, X. Yang, T. Pu, and Z. Peng, “Fine-grained recognition for oriented ship against complex scenes in optical remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–18, 2022.
- Q. Xiao, B. Liu, Z. Li, W. Ni, Z. Yang, and L. Li, “Progressive data augmentation method for remote sensing ship image classification based on imaging simulation system and neural style transfer,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 9176–9186, 2021.
- J. Chen, K. Chen, H. Chen, W. Li, Z. Zou, and Z. Shi, “Contrastive learning for fine-grained ship classification in remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022.
- W. Xiong, Z. Xiong, and Y. Cui, “An explainable attention network for fine-grained ship classification using remote-sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022.
- J. Shi, Z. guo Jiang, and H. Zhang, “Few-shot ship classification in optical remote sensing images using nearest neighbor prototype representation,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 3581–3590, 2021.
- Y. Li and C. Bian, “Few-shot fine-grained ship classification with a foreground-aware feature map reconstruction network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–12, 2022.
- Y. Li, C. Bian, and H. Chen, “Generalized ridge regression-based channelwise feature map weighted reconstruction network for fine-grained few-shot ship classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–10, 2023.
- J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” arXiv preprint arXiv:2304.00685, 2023.
- J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
- H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019.
- Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “Simvlm: Simple visual language model pretraining with weak supervision,” arXiv preprint arXiv:2108.10904, 2021.
- H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” Advances in Neural Information Processing Systems, vol. 35, pp. 32 897–32 912, 2022.
- K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” Proceedings of the European conference on computer vision (ECCV), pp. 201–216, 2018.
- W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” arXiv preprint arXiv:1908.08530, 2019.
- C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” International Conference on Machine Learning, pp. 4904–4916, 2021.
- X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in European Conference on Computer Vision. Springer, 2020, pp. 121–137.
- H. Yao, R. Zhang, and C. Xu, “Visual-language prompt tuning with knowledge-guided context optimization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6757–6767.
- D. Wertheimer, L. Tang, and B. Hariharan, “Few-shot classification with feature map reconstruction networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8012–8021.
- X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in European Conference on Computer Vision. Springer, 2020, pp. 121–137.
- A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 638–15 650.
- L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
- Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in European Conference on Computer Vision. Springer, 2020, pp. 104–120.
- Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, J. Gao et al., “Vision-language pre-training: Basics, recent advances, and future trends,” Foundations and Trends® in Computer Graphics and Vision, vol. 14, no. 3–4, pp. 163–352, 2022.
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 816–16 825.
- B. Zhu, Y. Niu, Y. Han, Y. Wu, and H. Zhang, “Prompt-aligned gradient for prompt tuning,” arXiv preprint arXiv:2205.14865, 2022.
- P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” arXiv preprint arXiv:2110.04544, 2021.
- R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, “Tip-adapter: Training-free clip-adapter for better vision-language modeling,” arXiv preprint arXiv:2111.03930, 2021.
- O. Pantazis, G. Brostow, K. Jones, and O. Mac Aodha, “Svl-adapter: Self-supervised adapter for vision-language pretrained models,” arXiv preprint arXiv:2210.03794, 2022.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
- D. Wang, J. Zhang, B. Du, G.-S. Xia, and D. Tao, “An empirical study of remote sensing pretraining,” IEEE Transactions on Geoscience and Remote Sensing, 2022.
- D. Wang, Q. Zhang, Y. Xu, J. Zhang, B. Du, D. Tao, and L. Zhang, “Advancing plain vision transformer toward remote sensing foundation model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2022.
- D. Wang, J. Zhang, B. Du, M. Xu, L. Liu, D. Tao, and L. Zhang, “Samrs: Scaling-up remote sensing segmentation dataset with segment anything model,” in Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
- P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
- Z. Luo, Y. Xiao, Y. Liu, S. Li, Y. Wang, Y. Tang, X. Li, and Y. Yang, “Soc: Semantic-assisted object cluster for referring video object segmentation,” in Advances in Neural Information Processing Systems, vol. 36. Curran Associates, Inc., 2023, pp. 26 425–26 437.
- Y. Xiao, Y. Ma, S. Li, H. Zhou, R. Liao, and X. Li, “Semanticac: Semantics-assisted framework for audio classification,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.