SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model (2401.09712v1)
Abstract: LLMs have recently been extended to the vision-language realm, obtaining impressive general multi-modal capabilities. However, the exploration of multi-modal LLMs (MLLMs) for remote sensing (RS) data is still in its infancy, and the performance is not satisfactory. In this work, we introduce SkyEyeGPT, a unified multi-modal LLM specifically designed for RS vision-language understanding. To this end, we meticulously curate an RS multi-modal instruction tuning dataset, including single-task and multi-task conversation instructions. After manual verification, we obtain a high-quality RS instruction-following dataset with 968k samples. Our research demonstrates that with a simple yet effective design, SkyEyeGPT works surprisingly well on considerably different tasks without the need for extra encoding modules. Specifically, after projecting RS visual features to the language domain via an alignment layer, they are fed jointly with task-specific instructions into an LLM-based RS decoder to predict answers for RS open-ended tasks. In addition, we design a two-stage tuning method to enhance instruction-following and multi-turn dialogue ability at different granularities. Experiments on 8 datasets for RS vision-language tasks demonstrate SkyEyeGPT's superiority in image-level and region-level tasks, such as captioning and visual grounding. In particular, SkyEyeGPT exhibits encouraging results compared to GPT-4V in some qualitative tests. The online demo, code, and dataset will be released in https://github.com/ZhanYang-nwpu/SkyEyeGPT.
- Capera: Captioning events in aerial videos. Remote Sensing, 15(8), 2023.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Nwpu-captions dataset and mlca-net for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, June 2023.
- Improving image captioning systems with postprocessing strategies. IEEE Transactions on Geoscience and Remote Sensing, 61:1–13, 2023.
- Rsgpt: A remote sensing vision language model and benchmark. arXiv preprint arXiv:2307.15266, 2023.
- Look before you leap: Learning landmark features for one-stage visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16888–16897, June 2021.
- Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Visual instruction tuning. In NeurIPS, 2023.
- Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, 2020.
- Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2018.
- Sound active attention framework for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing, 58(3):1985–2000, 2020.
- Era: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets]. IEEE Geoscience and Remote Sensing Magazine, 8(4):125–133, 2020.
- OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2022.
- OpenAI. Gpt-4 technical report. arXiv, 2023.
- Deep semantic understanding of high resolution remote sensing image. In 2016 International Conference on Computer, Information and Telecommunication Systems (CITS), pages 1–5, 2016.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763, 2021.
- Visual grounding in remote sensing images. In Proceedings of the 30th ACM International Conference on Multimedia, pages 404–412, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
- Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018.
- Earthnets: Empowering AI in earth observation. arXiv preprint arXiv:2210.04936, 2022.
- u-llava: Unifying multi-modal tasks via large language model. arXiv preprint arXiv:2311.05348, 2023.
- A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4683–4693, 2019.
- Improving one-stage visual grounding by recursive sub-query construction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 387–404, 2020.
- From easy to hard: Learning language-guided curriculum for visual question answering on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2022.
- Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2022.
- Parameter-efficient transfer learning for remote sensing image–text retrieval. IEEE Transactions on Geoscience and Remote Sensing, 61:1–14, 2023.
- Overcoming language bias in remote sensing visual question answering via adversarial training. In IGARSS 2023 - 2023 IEEE International Geoscience and Remote Sensing Symposium, pages 2235–2238, 2023.
- Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61:1–13, 2023.
- Mono3dvg: 3d visual grounding in monocular images. arXiv preprint arXiv:2312.08022, 2023.
- A spatial hierarchical reasoning network for remote sensing visual question answering. IEEE Transactions on Geoscience and Remote Sensing, 61:1–15, 2023.
- Mutual attention inception network for remote sensing visual question answering. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Yang Zhan (7 papers)
- Zhitong Xiong (36 papers)
- Yuan Yuan (234 papers)