ReplanVLM: Replanning Robotic Tasks with Visual Language Models (2407.21762v1)
Abstract: LLMs have gained increasing popularity in robotic task planning due to their exceptional abilities in text analytics and generation, as well as their broad knowledge of the world. However, they fall short in decoding visual cues. LLMs have limited direct perception of the world, which leads to a deficient grasp of the current state of the world. By contrast, the emergence of visual LLMs (VLMs) fills this gap by integrating visual perception modules, which can enhance the autonomy of robotic task planning. Despite these advancements, VLMs still face challenges, such as the potential for task execution errors, even when provided with accurate instructions. To address such issues, this paper proposes a ReplanVLM framework for robotic task planning. In this study, we focus on error correction interventions. An internal error correction mechanism and an external error correction mechanism are presented to correct errors under corresponding phases. A replan strategy is developed to replan tasks or correct error codes when task execution fails. Experimental results on real robots and in simulation environments have demonstrated the superiority of the proposed framework, with higher success rates and robust error correction capabilities in open-world tasks. Videos of our experiments are available at https://youtu.be/NPk2pWKazJc.
- B. Zhang and H. Soh, “Large language models as zero-shot human models for human-robot interaction,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 7961–7968.
- K. Hori, K. Suzuki, and T. Ogata, “Interactively robot action planning with uncertainty analysis and active questioning by large language model,” in 2024 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2024, pp. 85–91.
- B. Yu, H. Kasaei, and M. Cao, “L3MVN: Leveraging large language models for visual target navigation,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 3554–3560.
- L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati, “Leveraging pre-trained large language models to construct and utilize world models for model-based task planning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- Z. Bao, G.-N. Zhu, W. Ding, Y. Guan, W. Bai, and Z. Gan, “A smart interactive camera robot based on large language models,” in 2023 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2023, pp. 1–6.
- S. S. Raman, V. Cohen, E. Rosen, I. Idrees, D. Paulius, and S. Tellex, “Planning with large language models via corrective re-prompting,” in NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
- T. Silver, S. Dan, K. Srinivas, J. B. Tenenbaum, L. P. Kaelbling, and M. Katz, “Generalized planning in PDDL domains with pretrained large language models,” arXiv preprint arXiv:2305.11014, 2023.
- X. Zhao, M. Li, C. Weber, M. B. Hafez, and S. Wermter, “Chat with the environment: Interactive multimodal perception using large language models,” arXiv preprint arXiv:2303.08268, 2023.
- Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, and T. Funkhouser, “Tidybot: Personalized robot assistance with large language models,” arXiv preprint arXiv:2305.05658, 2023.
- J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, “Physically grounded vision-language models for robotic manipulation,” arXiv preprint arXiv:2309.02561, 2023.
- F. Zeng, W. Gan, Y. Wang, N. Liu, and P. S. Yu, “Large language models for robotics: A survey,” arXiv preprint arXiv:2311.07226, 2023.
- A. Mei, J. Wang, G.-N. Zhu, and Z. Gan, “GameVLM: A decision-making framework for robotic task planning based on visual language models and zero-sum games,” arXiv preprint arXiv:2405.13751, 2024.
- B. Li, P. Wu, P. Abbeel, and J. Malik, “Interactive task planning with language models,” arXiv preprint arXiv:2310.10645, 2023.
- Z. Luan, Y. Lai, R. Huang, S. Bai, Y. Zhang, H. Zhang, and Q. Wang, “Enhancing robot task planning and execution through multi-layer large language models,” Sensors, vol. 24, no. 5, p. 1687, 2024.
- W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and F.-F. Li, “VoxPoser: Composable 3d value maps for robotic manipulation with language models,” arXiv preprint arXiv:2307.05973, 2023.
- X. Zhang, Y. Ding, S. Amiri, H. Yang, A. Kaminski, C. Esselink, and S. Zhang, “Grounding classical task planners via vision-language models,” arXiv preprint arXiv:2304.08587, 2023.
- Y. Hu, F. Lin, T. Zhang, L. Yi, and Y. Gao, “Look before you leap: Unveiling the power of GPT-4V in robotic vision-language planning,” arXiv preprint arXiv:2311.17842, 2023.
- M. Skreta, Z. Zhou, J. L. Yuan, K. Darvish, A. Aspuru-Guzik, and A. Garg, “Replan: Robotic replanning with perception and language models,” arXiv preprint arXiv:2401.04157, 2024.
- K. Shirai, C. C. Beltran-Hernandez, M. Hamaya, A. Hashimoto, S. Tanaka, K. Kawaharazuka, K. Tanaka, Y. Ushiku, and S. Mori, “Vision-language interpreter for robot task planning,” arXiv preprint arXiv:2311.00967, 2023.
- N. Wake, A. Kanehira, K. Sasabuchi, J. Takamatsu, and K. Ikeuchi, “GPT-4V (ision) for robotics: Multimodal task planning from human demonstration,” arXiv preprint arXiv:2311.12015, 2023.
- A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian et al., “Do as I can, not as I say: Grounding language in robotic affordances,” in Conference on Robot Learning. PMLR, 2023, pp. 287–318.
- I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “PROGPROMPT: Generating situated robot task plans using large language models,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 11 523–11 530.
- Aoran Mei (3 papers)
- Guo-Niu Zhu (5 papers)
- Huaxiang Zhang (11 papers)
- Zhongxue Gan (33 papers)