Papers
Topics
Authors
Recent
2000 character limit reached

AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models (2408.15511v1)

Published 28 Aug 2024 in cs.RO and cs.AI

Abstract: Aerospace embodied intelligence aims to empower unmanned aerial vehicles (UAVs) and other aerospace platforms to achieve autonomous perception, cognition, and action, as well as egocentric active interaction with humans and the environment. The aerospace embodied world model serves as an effective means to realize the autonomous intelligence of UAVs and represents a necessary pathway toward aerospace embodied intelligence. However, existing embodied world models primarily focus on ground-level intelligent agents in indoor scenarios, while research on UAV intelligent agents remains unexplored. To address this gap, we construct the first large-scale real-world image-text pre-training dataset, AerialAgent-Ego10k, featuring urban drones from a first-person perspective. We also create a virtual image-text-pose alignment dataset, CyberAgent Ego500k, to facilitate the pre-training of the aerospace embodied world model. For the first time, we clearly define 5 downstream tasks, i.e., aerospace embodied scene awareness, spatial reasoning, navigational exploration, task planning, and motion decision, and construct corresponding instruction datasets, i.e., SkyAgent-Scene3k, SkyAgent-Reason3k, SkyAgent-Nav3k and SkyAgent-Plan3k, and SkyAgent-Act3k, for fine-tuning the aerospace embodiment world model. Simultaneously, we develop SkyAgentEval, the downstream task evaluation metrics based on GPT-4, to comprehensively, flexibly, and objectively assess the results, revealing the potential and limitations of 2D/3D visual LLMs in UAV-agent tasks. Furthermore, we integrate over 10 2D/3D visual-LLMs, 2 pre-training datasets, 5 finetuning datasets, more than 10 evaluation metrics, and a simulator into the benchmark suite, i.e., AeroVerse, which will be released to the community to promote exploration and development of aerospace embodied intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. DJI, “Dji drone solutions for inspection and infrastructure construction in the oil and gas industry,” Website, 2022. [Online]. Available: https://enterprise.dji.com/cn/oil-and-gas.
  2. M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10 737–10 746, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:208617407
  3. A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds.   Online: Association for Computational Linguistics, Nov. 2020, pp. 4392–4412. [Online]. Available: https://aclanthology.org/2020.emnlp-main.356
  4. C. Li, F. Xia, R. Mart’in-Mart’in, M. Lingelbach, S. Srivastava, B. Shen, K. Vainio, C. Gokmen, G. Dharan, T. Jain, A. Kurenkov, K. Liu, H. Gweon, J. Wu, L. Fei-Fei, and S. Savarese, “igibson 2.0: Object-centric simulation for robot learning of everyday household tasks,” ArXiv, vol. abs/2108.03272, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:236957210
  5. Y. Qi, Q. Wu, P. Anderson, X. E. Wang, W. Y. Wang, C. Shen, and A. van den Hengel, “Reverie: Remote embodied visual referring expression in real indoor environments,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9979–9988, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:214264259
  6. B. Shen, F. Xia, C. Li, R. Mart’in-Mart’in, L. J. Fan, G. Wang, S. Buch, C. P. D’Arpino, S. Srivastava, L. P. Tchapmi, M. E. Tchapmi, K. Vainio, L. Fei-Fei, and S. Savarese, “igibson 1.0: A simulation environment for interactive tasks in large realistic scenes,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7520–7527, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:227347434
  7. F. Xia, B. W. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev, R. Martín-Martín, and S. Savarese, “Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments,” IEEE Robotics and Automation Letters, vol. 5, pp. 713–720, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:210931408
  8. J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y. Wang, Q. Li, S.-C. Zhu, B. Jia, and S. Huang, “An embodied generalist agent in 3d world,” in Proceedings of the International Conference on Machine Learning (ICML), 2024.
  9. Y. Hong, H. Zhen, P. Chen, S. Zheng, Y. Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,” ArXiv, vol. abs/2307.12981, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:260356619
  10. D. Driess, F. Xia, M. S. M. Sajjadi, and et al, “Palm-e: An embodied multimodal language model,” in International Conference on Machine Learning, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257364842
  11. A. Brohan, N. Brown, J. Carbajal, and et al, “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” ArXiv, vol. abs/2307.15818, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:260293142
  12. Y. Mu, Q. Zhang, M. Hu, W. Wang, M. Ding, J. Jin, B. Wang, J. Dai, Y. Qiao, and P. Luo, “Embodiedgpt: Vision-language pre-training via embodied chain of thought,” ArXiv, vol. abs/2305.15021, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258865718
  13. Y. Zeng, X. Zhang, H. Li, J. Wang, J. Zhang, and W. Zhou, “X22-vlm: All-in-one pre-trained model for vision-language tasks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3156–3168, 2024.
  14. A. J. Wang, P. Zhou, M. Z. Shou, and S. Yan, “Enhancing visual grounding in vision-language pre-training with position-guided text prompts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3406–3421, 2024.
  15. J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, pp. 5625–5644, 2024.
  16. S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics, 2017. [Online]. Available: https://arxiv.org/abs/1705.05065
  17. Y. Liu, F. Xue, and H. Huang, “Urbanscene3d: A large scale urban scene dataset and simulator,” ArXiv, vol. abs/2107.04286, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235790599
  18. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Annual Meeting of the Association for Computational Linguistics, 2002. [Online]. Available: https://api.semanticscholar.org/CorpusID:11080756
  19. P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice: Semantic propositional image caption evaluation,” ArXiv, vol. abs/1607.08822, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:11933981
  20. OpenAI, “Gpt-4 technical report,” 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257532815
  21. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” ArXiv, vol. abs/2302.13971, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257219404
  22. J. Bai, S. Bai, Y. Chu, Z. Cui, and et al, “Qwen technical report,” ArXiv, vol. abs/2309.16609, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:263134555
  23. N. Lambert, L. Castricato, L. von Werra, and A. Havrilla, “Illustrating reinforcement learning from human feedback (rlhf),” Hugging Face Blog, 2022, https://huggingface.co/blog/rlhf.
  24. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” ArXiv, vol. abs/2306.05685, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259129398
  25. D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” 2021.
  26. D. Z. Chen, A. X. Chang, and M. Niener, “Scanrefer: 3d object localization in rgb-d scans using natural language,” in European Conference on Computer Vision, 2020.
  27. J. Tubiana, D. Schneidman-Duhovny, and H. Wolfson, “Scannet: An interpretable geometric deep learning model for structure-based protein binding site prediction,” Cold Spring Harbor Laboratory, 2021.
  28. A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” IEEE, 2018.
  29. M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” 2019.
  30. P. Anderson, Q. Wu, D. Teney, J. Bruce, and A. V. D. Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  31. G. Yang, F. Xue, Q. Zhang, K. Xie, C.-W. Fu, and H. Huang, “Urbanbis: a large-scale benchmark for fine-grained urban building instance segmentation,” ACM SIGGRAPH 2023 Conference Proceedings, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258480220
  32. H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” ArXiv, vol. abs/2310.03744, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:263672058
  33. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” ArXiv, vol. abs/2304.08485, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258179774
  34. D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” ArXiv, vol. abs/2304.10592, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258291930
  35. J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256390509
  36. S. Roukos, K. Papineni, T. Ward, and W. J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in 40th Annual Meeting of the Association for Computational Linguistics:(CD:CD-CNF-0517), 2002.
  37. R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:9026666
  38. W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. A. Li, P. Fung, and S. C. H. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” ArXiv, vol. abs/2305.06500, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258615266
  39. P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yue, H. Li, and Y. J. Qiao, “Llama-adapter v2: Parameter-efficient visual instruction model,” ArXiv, vol. abs/2304.15010, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258418343
  40. Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen, J. Tian, Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Modularization empowers large language models with multimodality,” ArXiv, vol. abs/2304.14178, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258352455
  41. Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou, “mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration,” ArXiv, vol. abs/2311.04257, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:265050943
  42. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond,” 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:261101015

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: