Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 77 tok/s
Gemini 2.5 Pro 32 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing (2410.19552v2)

Published 25 Oct 2024 in cs.CV

Abstract: Detecting temporal changes in geographical landscapes is critical for applications like environmental monitoring and urban planning. While remote sensing data is abundant, existing vision-LLMs (VLMs) often fail to capture temporal dynamics effectively. This paper addresses these limitations by introducing an annotated dataset of video frame pairs to track evolving geographical patterns over time. Using fine-tuning techniques like Low-Rank Adaptation (LoRA), quantized LoRA (QLoRA), and model pruning on models such as Video-LLaVA and LLaVA-NeXT-Video, we significantly enhance VLM performance in processing remote sensing temporal changes. Results show significant improvements, with the best performance achieving a BERT score of 0.864 and ROUGE-1 score of 0.576, demonstrating superior accuracy in describing land-use transformations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Prompt–rsvqa: Prompting visual context to a language model for remote sensing visual question answering. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1371–1380.
  2. Hao Chen and Zhenwei Shi. 2020. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sensing, 12(10).
  3. Change detection methods for remote sensing in the last decade: A comprehensive review. Remote Sensing, 16(13).
  4. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180.
  5. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery. Advances in Neural Information Processing Systems, 35:197–211.
  6. QLoRA: Efficient finetuning of quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems.
  7. Jacob Devlin. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  8. Spacenet: A remote sensing dataset and challenge series. Preprint, arXiv:1807.01232.
  9. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  10. Rsgpt: A remote sensing vision language model and benchmark. Preprint, arXiv:2307.15266.
  11. Geochat: Grounded large vision-language model for remote sensing. The IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  12. Vision-language models in remote sensing: Current progress and future trends. IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66.
  13. Rs-clip: Zero shot remote sensing scene classification via contrastive vision-language supervision. International Journal of Applied Earth Observation and Geoinformation, 124:103497.
  14. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
  15. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  16. Remoteclip: A vision language foundation model for remote sensing. IEEE Transactions on Geoscience and Remote Sensing, 62:1–16.
  17. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306.
  18. Llava-next: Improved reasoning, ocr, and world knowledge.
  19. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
  20. Language models as black-box optimizers for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12687–12697.
  21. Era: A data set and deep learning benchmark for event recognition in aerial videos [software and data sets]. IEEE Geoscience and Remote Sensing Magazine, 8(4):125–133.
  22. OpenAI. 2024. Chatgpt: A large language model. https://chat.openai.com/. Accessed: 2024-10-07.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  24. Rajvardhan Patil and Venkat Gudivada. 2024. A review of current trends, techniques, and challenges in large language models (llms). Applied Sciences, 14(5):2074.
  25. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding. IEEE Access, 9:89644–89654.
  26. A survey of modelling trends in temporal gis. ACM Comput. Surv., 51(2).
  27. Using historical maps within a gis to analyze two centuries of rural landscape changes in southern italy. Land, 6(3).
  28. Villa: Fine-grained vision-language representation learning from real-world data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22225–22235.
  29. EfficientVLM: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 13899–13913, Toronto, Canada. Association for Computational Linguistics.
  30. Skyscript: A large and semantically diverse vision-language dataset for remote sensing. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6):5805–5813.
  31. Gis and remote sensing application for vegetation mapping. In T. Choudhury, B. Koley, A. Nath, JS. Um, and A.K. Patidar, editors, Geo-Environmental Hazards using AI-enabled Geospatial Techniques and Earth Observation Systems, Advances in Geographic Information Science. Springer, Cham.
  32. From easy to hard: Learning language-guided curriculum for visual question answering on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 60.
  33. Vision-language models for vision tasks: A survey. Preprint, arXiv:2304.00685.
  34. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  35. Msnet: A multilevel instance segmentation network for natural disaster damage assessment in aerial videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2710–2720. IEEE.
  36. Transforming remote sensing images to textual descriptions. International Journal of Applied Earth Observation and Geoinformation, 108:102741.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube