GeoLLM-Engine: A Realistic Environment for Building Geospatial Copilots (2404.15500v1)
Abstract: Geospatial Copilots unlock unprecedented potential for performing Earth Observation (EO) applications through natural language instructions. However, existing agents rely on overly simplified single tasks and template-based prompts, creating a disconnect with real-world scenarios. In this work, we present GeoLLM-Engine, an environment for tool-augmented agents with intricate tasks routinely executed by analysts on remote sensing platforms. We enrich our environment with geospatial API tools, dynamic maps/UIs, and external multimodal knowledge bases to properly gauge an agent's proficiency in interpreting realistic high-level natural language commands and its functional correctness in task completions. By alleviating overheads typically associated with human-in-the-loop benchmark curation, we harness our massively parallel engine across 100 GPT-4-Turbo nodes, scaling to over half a million diverse multi-tool tasks and across 1.1 million satellite images. By moving beyond traditional single-task image-caption paradigms, we investigate state-of-the-art agents and prompting techniques against long-horizon prompts.
- Estimating the worldwide extent of illegal fishing. PloS one, 4(2):e4570, 2009.
- Satlaspretrain: A large-scale dataset for remote sensing image understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023.
- Openai gym. CoRR, abs/1606.01540, 2016.
- Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
- Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018.
- Mind2web: Towards a generalist agent for the web, 2023.
- Department of Economic and Social Affairs, United Nations. Sustainable development goals. https://sdgs.un.org/goals, 2023. Accessed: March-2024.
- The faiss library, 2024.
- Geckopt: Llm system efficiency via intent-based tool selection, 2024.
- Remote sensing chatgpt: Solving remote sensing tasks with chatgpt and visual models, 2024.
- xbd: A dataset for assessing building damage from satellite imagery, 2019.
- Rsgpt: A remote sensing vision language model and benchmark, 2023.
- Foundation models for generalist geospatial artificial intelligence, 2023.
- Stable diffusion for aerial object detection. In NeurIPS 2023 Workshop on Synthetic Data Generation with Generative AI, 2023.
- Llmlingua: Compressing prompts for accelerated inference of large language models, 2023.
- An llm compiler for parallel function calling, 2024.
- Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024.
- Geochat: Grounded large vision-language model for remote sensing, 2023.
- xview: Objects in context in overhead imagery, 2018.
- Yann LeCun. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1), 2022.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. 2004.
- Reinforcement learning on web interfaces using workflow-guided exploration. CoRR, abs/1802.08802, 2018.
- Remoteclip: A vision language foundation model for remote sensing, 2024.
- Rsvqa: Visual question answering for remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, 2020.
- Chameleon: Plug-and-play compositional reasoning with large language models, 2023.
- Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2018.
- The sarfish dataset and challenge. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, pages 752–761, 2024.
- Tofu: A task of fictitious unlearning for llms, 2024.
- Remote sensing vision-language foundation models without annotations via ground remote alignment, 2023.
- Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model, 2024.
- xview3-sar: Detecting dark fishing activity using synthetic aperture radar imagery, 2022.
- Charting new territories: Exploring the geographic and geospatial capabilities of multimodal llms, 2024.
- Rapid building damage assessment workflow: An implementation for the 2023 rolling fork, mississippi tornado event. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3760–3764, 2023.
- World of bits: An open-domain platform for web-based agents. In Proceedings of the 34th International Conference on Machine Learning, pages 3135–3144. PMLR, 2017.
- Large language models for captioning and retrieving remote sensing images, 2024.
- Evaluating tool-augmented agents in remote sensing platforms, 2024.
- Nexusraven: a commercially-permissive language model for function calling. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
- Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. In IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, pages 5901–5904. IEEE, 2019.
- Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022.
- Self-consistency improves chain of thought reasoning in language models, 2023a.
- Skyscript: A large and semantically diverse vision-language dataset for remote sensing, 2023b.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Fine-tuning language models using formal methods feedback, 2023a.
- Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023b.
- Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, pages 20744–20757. Curran Associates, Inc., 2022.
- React: Synergizing reasoning and acting in language models, 2023.
- Chatearthnet: A global-scale image-text dataset empowering vision-language geo-foundation models, 2024.
- Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 61:1–13, 2023.
- Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model, 2024.
- Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain, 2024a.
- Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing, 2024b.
- Webarena: A realistic web environment for building autonomous agents, 2023.
- Toolqa: A dataset for llm question answering with external tools, 2023.