REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction (2306.15724v4)
Abstract: The ability to detect and analyze failed executions automatically is crucial for an explainable and robust robotic system. Recently, LLMs have demonstrated strong reasoning abilities on textual inputs. To leverage the power of LLMs for robot failure explanation, we introduce REFLECT, a framework which queries LLM for failure reasoning based on a hierarchical summary of robot past experiences generated from multisensory observations. The failure explanation can further guide a language-based planner to correct the failure and complete the task. To systematically evaluate the framework, we create the RoboFail dataset with a variety of tasks and failure scenarios. We demonstrate that the LLM-based framework is able to generate informative failure explanations that assist successful correction planning.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
- Inner monologue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608, 2022.
- Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
- Ai2-thor: An interactive 3d environment for visual ai. arXiv preprint arXiv:1712.05474, 2017.
- End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6847–6857, 2021.
- Sketch, ground, and refine: Top-down dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 234–243, June 2021.
- Unifying event detection and captioning as sequence generation via pre-training. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 363–379. Springer, 2022.
- End-to-end dense video captioning as sequence generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 5651–5665, 2022.
- Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv, 2022.
- Language models with image descriptors are strong few-shot video-language learners. arXiv preprint arXiv:2205.10747, 2022.
- Summarize the past to predict the future: Natural language descriptions of context boost multimodal object interaction. arXiv preprint arXiv:2301.09209, 2023.
- C. DeChant and D. Bauer. Summarizing a virtual robot’s past actions in natural language. arXiv preprint arXiv:2203.06671, 2022.
- User study exploring the role of explanation of failures by robots in human robot collaboration tasks. arXiv preprint arXiv:2303.16010, 2023.
- Human trust after robot mistakes: Study of the effects of different forms of robot communication. In 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pages 1–7, 2019. doi:10.1109/RO-MAN46459.2019.8956424.
- Explainable ai for robot failures: Generating explanations that improve user assistance in fault recovery. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, pages 351–360, 2021.
- D. Das and S. Chernova. Semantic-based explainable ai: Leveraging semantic scene graphs and pairwise ranking to explain robot failures. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3034–3041. IEEE, 2021.
- M. Diehl and K. Ramirez-Amaro. Why did i fail? a causal-based method to find explanations for robot failures. IEEE Robotics and Automation Letters, 7(4):8925–8932, 2022.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. arXiv preprint arXiv:2212.04088, 2022.
- Fino-net: A deep multimodal sensor fusion framework for manipulation failure detection. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6841–6847. IEEE, 2021.
- Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022.
- Planning with large language models via corrective re-prompting. arXiv preprint arXiv:2211.09935, 2022.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
- Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting. arXiv preprint arXiv:2303.14100, 2023.
- Generalized planning in pddl domains with pretrained large language models. arXiv preprint arXiv:2305.11014, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Embodied semantic scene graph generation. In A. Faust, D. Hsu, and G. Neumann, editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 1585–1594. PMLR, 08–11 Nov 2022.
- Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE, 2022.
- Wav2clip: Learning robust audio representations from clip. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4563–4567. IEEE, 2022.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Embodied semantic scene graph generation. In Conference on Robot Learning, pages 1585–1594. PMLR, 2022.
- Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651, 2023.
- Modeling dynamic environments with scene graph memory. In International Conference on Machine Learning, pages 17976–17993. PMLR, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.