Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots (2404.15500v1)

Published 23 Apr 2024 in cs.AI, cs.CL, and cs.LG

Abstract: Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Estimating the worldwide extent of illegal fishing. PloS one, 4(2):e4570, 2009.
  2. Satlaspretrain: A large-scale dataset for remote sensing image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023.
  3. Openai gym. CoRR, abs/1606.01540, 2016.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  5. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018.
  6. Mind2web: Towards a generalist agent for the web, 2023.
  7. Department of Economic and Social Affairs, United Nations. Sustainable development goals. https://sdgs.un.org/goals, 2023. Accessed: March-2024.
  8. The faiss library, 2024.
  9. Geckopt: Llm system efficiency via intent-based tool selection, 2024.
  10. Remote sensing chatgpt: Solving remote sensing tasks with chatgpt and visual models, 2024.
  11. xbd: A dataset for assessing building damage from satellite imagery, 2019.
  12. Rsgpt: A remote sensing vision language model and benchmark, 2023.
  13. Foundation models for generalist geospatial artificial intelligence, 2023.
  14. Stable diffusion for aerial object detection. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023.
  15. Llmlingua: Compressing prompts for accelerated inference of large language models, 2023.
  16. An llm compiler for parallel function calling, 2024.
  17. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024.
  18. Geochat: Grounded large vision-language model for remote sensing, 2023.
  19. xview: Objects in context in overhead imagery, 2018.
  20. Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1), 2022.
  21. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. 2004.
  22. Reinforcement learning on web interfaces using workflow-guided exploration. CoRR, abs/1802.08802, 2018.
  23. Remoteclip: A vision language foundation model for remote sensing, 2024.
  24. Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, 2020.
  25. Chameleon: Plug-and-play compositional reasoning with large language models, 2023.
  26. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2018.
  27. The sarfish dataset and challenge. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 752–761, 2024.
  28. Tofu: A task of fictitious unlearning for llms, 2024.
  29. Remote sensing vision-language foundation models without annotations via ground remote alignment, 2023.
  30. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model, 2024.
  31. xview3-sar: Detecting dark fishing activity using synthetic aperture radar imagery, 2022.
  32. Charting new territories: Exploring the geographic and geospatial capabilities of multimodal llms, 2024.
  33. Rapid building damage assessment workflow: An implementation for the 2023 rolling fork, mississippi tornado event. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3760–3764, 2023.
  34. World of bits: An open-domain platform for web-based agents. In Proceedings of the 34th International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
  35. Large language models for captioning and retrieving remote sensing images, 2024.
  36. Evaluating tool-augmented agents in remote sensing platforms, 2024.
  37. Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  38. Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. In IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, pages 5901–5904. IEEE, 2019.
  39. Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022.
  40. Self-consistency improves chain of thought reasoning in language models, 2023a.
  41. Skyscript: A large and semantically diverse vision-language dataset for remote sensing, 2023b.
  42. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  43. Fine-tuning language models using formal methods feedback, 2023a.
  44. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023b.
  45. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, pages 20744–20757. Curran Associates, Inc., 2022.
  46. React: Synergizing reasoning and acting in language models, 2023.
  47. Chatearthnet: A global-scale image-text dataset empowering vision-language geo-foundation models, 2024.
  48. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61:1–13, 2023.
  49. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model, 2024.
  50. Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain, 2024a.
  51. Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing, 2024b.
  52. Webarena: A realistic web environment for building autonomous agents, 2023.
  53. Toolqa: A dataset for llm question answering with external tools, 2023.
Citations (8)

Summary

We haven't generated a summary for this paper yet.