Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments (2401.12975v1)

Published 23 Jan 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Recent advances in high-fidelity virtual environments serve as one of the major driving forces for building intelligent embodied agents to perceive, reason and interact with the physical world. Typically, these environments remain unchanged unless agents interact with them. However, in real-world scenarios, agents might also face dynamically changing environments characterized by unexpected events and need to rapidly take action accordingly. To remedy this gap, we propose a new simulated embodied benchmark, called HAZARD, specifically designed to assess the decision-making abilities of embodied agents in dynamic situations. HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind, and specifically supports the utilization of LLMs to assist common sense reasoning and decision-making. This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines, including reinforcement learning (RL), rule-based, and search-based methods in dynamically changing environments. As a first step toward addressing this challenge using LLMs, we further develop an LLM-based agent and perform an in-depth analysis of its promise and challenge of solving these challenging tasks. HAZARD is available at https://vis-www.cs.umass.edu/hazard/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Robocup rescue simulation. https://rescuesim.robocup.org/, 2023.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  3. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
  4. Informative path planning for an autonomous underwater vehicle. In 2010 IEEE International Conference on Robotics and Automation, pp.  4791–4796. IEEE, 2010.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
  7. Survey and requirements for search and rescue ground and air vehicles for mining applications. In 2012 19th International Conference on Mechatronics and Machine Vision in Practice (M2VIP), pp.  105–109. IEEE, 2012.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Search and rescue robotics-from theory to practice, 2017.
  10. Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1–10, 2018.
  11. Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
  12. Threedworld: A platform for interactive multi-modal physical simulation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  13. Survey on unmanned aerial vehicle networks for civil applications: A communications viewpoint. IEEE Communications Surveys & Tutorials, 18(4):2624–2661, 2016.
  14. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp.  2961–2969, 2017.
  15. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
  16. Housekeep: Tidying virtual households using commonsense reasoning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIX, pp.  355–373. Springer, 2022.
  17. Segment anything. arXiv:2304.02643, 2023.
  18. Bandit based monte-carlo planning. In Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17, pp.  282–293. Springer, 2006.
  19. Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
  20. Spot the difference: A novel task for embodied agents in changing environments. In 2022 26th International Conference on Pattern Recognition (ICPR), pp.  4182–4188. IEEE, 2022.
  21. igibson 2.0: Object-centric simulation for robot learning of everyday household tasks. In 5th Annual Conference on Robot Learning.
  22. Path planning technologies for autonomous underwater vehicles-a review. Ieee Access, 7:9745–9768, 2018.
  23. Pre-trained language models for interactive decision-making. Advances in Neural Information Processing Systems, 35:31199–31212, 2022.
  24. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022.
  25. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
  26. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023a.
  27. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
  28. Multi criteria decision analysis (mcda) of unmanned aerial vehicles (uavs) as a part of standard response to emergencies. In 4th International Conference on Green Computing and Engineering Technologies, pp.  31. Gyancity International Publishers, 2018.
  29. Cooperative fire detection using unmanned aerial vehicles. In Proceedings of the 2005 IEEE international conference on robotics and automation, pp.  1884–1889. IEEE, 2005.
  30. Multiple eyes in the skies: architecture and perception issues in the comets unmanned air vehicles project. IEEE robotics & automation magazine, 12(2):46–57, 2005.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  32. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  8494–8502, 2018.
  33. Path planning for ground robots in agriculture: A short review. In 2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), pp.  61–66. IEEE, 2020.
  34. Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931, 2017.
  35. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9339–9347, 2019.
  36. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
  37. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on Robot Learning, pp.  492–504. PMLR, 2023.
  38. Unmanned aerial vehicles (uavs): A survey on civil applications and key research challenges. Ieee Access, 7:48572–48634, 2019.
  39. igibson 1.0: A simulation environment for interactive tasks in large realistic scenes. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  7520–7527. IEEE, 2021.
  40. Llm-planner: Few-shot grounded planning for embodied agents with large language models. arXiv preprint arXiv:2212.04088, 2022.
  41. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems, 34:251–266, 2021.
  42. Using auction-based task allocation scheme for simulation optimization of search and rescue in disaster relief. Simulation Modelling Practice and Theory, 82:132–146, 2018.
  43. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  45. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
  46. Chatgpt empowered long-step robot control in various environments: A case application. arXiv preprint arXiv:2304.03893, 2023.
  47. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents, 2023.
  48. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11097–11107, 2020a.
  49. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11097–11107, 2020b.
  50. Chalet: Cornell house agent learning environment. arXiv preprint arXiv:1801.07357, 2018.
  51. Foundation models for decision making: Problems, methods, and opportunities. arXiv preprint arXiv:2303.04129, 2023.
  52. A review on marine search and rescue operations using unmanned aerial vehicles. International Journal of Marine and Environmental Sciences, 9(2):396–399, 2015.
  53. Building generalizable agents with a realistic and rich 3d environment, 2018. URL https://openreview.net/forum?id=rkaT3zWCZ.
  54. Moving forward by moving backward: Embedding action impact over action semantics. In The Eleventh International Conference on Learning Representations, 2022.
  55. A survey on path planning for persistent autonomy of autonomous underwater vehicles. Ocean Engineering, 110:303–313, 2015.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qinhong Zhou (6 papers)
  2. Sunli Chen (6 papers)
  3. Yisong Wang (14 papers)
  4. Haozhe Xu (3 papers)
  5. Weihua Du (7 papers)
  6. Hongxin Zhang (47 papers)
  7. Yilun Du (113 papers)
  8. Joshua B. Tenenbaum (257 papers)
  9. Chuang Gan (195 papers)
Citations (8)

Summary

Introduction

Embodied agents, or robots, are increasingly becoming a part of our daily lives, assisting with tasks ranging from mundane chores to complex rescue operations. To function effectively, these agents must not only navigate but also make real-time decisions in reaction to dynamic changes in their environment. The evaluation of these capabilities has been somewhat neglected, as existing simulation platforms prioritize agent-driven interactions over the spontaneous, unpredictable changes that characterize real-world settings. The paper introduced here addresses this gap by proposing the HAZARD challenge, which evaluates an agent's ability to make decisions in the presence of unexpected disasters such as fires, floods, and winds.

Dynamic Environments in Embodied AI

The HAZARD challenge builds upon the ThreeDWorld (TDW) platform, augmenting it with physical simulation and visual effects that can handle intricate environmental changes, from flames engulfing a room to floods submerging objects. Agents are tasked with rescuing valuable items by accurately perceiving the situation, reasoning about the effects of dynamic changes like rising temperatures or water levels, and planning a rescue strategy. A key innovation of this work is the integration of LLMs to assist with the tasks by providing a semantic understanding of observations, thereby enhancing decision-making.

Leveraging LLMs for Decision Making

To optimally leverage LLMs, the authors developed APIs that integrate visual and historical information into textual formats, reducing the frequency of queries needed to make decisions. This setup allowed for evaluating various decision-making pipelines, including LLM-based agents, against rule-based, search-based, and reinforcement learning approaches. Findings revealed that while LLMs could process basic environmental factors successfully, they faced challenges with more complex elements such as predicting dynamic changes or maintaining consistency between reasoning and prediction.

Experiments and Results

Experimental setups for the HAZARD challenge involved an array of indoor and outdoor scenarios, with a procedural generation pipeline ensuring varied and unpredictable environments. Agents were equipped with a refined action space for task execution, and their performance was measured across metrics such as rescue value rate, rescue step, and damage rate. LLM pipelines, using backbones like Llama-13b, GPT-3.5, and GPT-4, demonstrated notable decision-making skills even in zero-shot settings, particularly the GPT-4 exhibiting superior performance. However, when perception challenges were introduced, such as object obscurity due to environmental conditions, all methods suffered performance degradation, highlighting perception as a critical area for further research.

Conclusion

The HAZARD challenge represents a significant step forward in the domain of embodied AI, moving towards the assessment of agents' decision-making abilities in response to dynamic environmental changes. The work underlines the potential of incorporating LLMs into embodied AI tasks and opens up new avenues for future research, including action development to mitigate disaster effects and the integration of more complex decision-making capabilities. The findings from the HAZARD challenge contribute to our understanding of the intersection between embodied AI and disaster management, with implications for autonomous agents' real-world applicability in safety-critical applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com