Human Demonstrations are Generalizable Knowledge for Robots (2312.02419v2)
Abstract: Learning from human demonstrations is an emerging trend for designing intelligent robotic systems. However, previous methods typically regard videos as instructions, simply dividing them into action sequences for robotic repetition, which poses obstacles to generalization to diverse tasks or object instances. In this paper, we propose a different perspective, considering human demonstration videos not as mere instructions, but as a source of knowledge for robots. Motivated by this perspective and the remarkable comprehension and generalization capabilities exhibited by LLMs, we propose DigKnow, a method that DIstills Generalizable KNOWledge with a hierarchical structure. Specifically, DigKnow begins by converting human demonstration video frames into observation knowledge. This knowledge is then subjected to analysis to extract human action knowledge and further distilled into pattern knowledge compassing task and object instances, resulting in the acquisition of generalizable knowledge with a hierarchical structure. In settings with different tasks or object instances, DigKnow retrieves relevant knowledge for the current task and object instances. Subsequently, the LLM-based planner conducts planning based on the retrieved knowledge, and the policy executes actions in line with the plan to achieve the designated task. Utilizing the retrieved knowledge, we validate and rectify planning and execution outcomes, resulting in a substantial enhancement of the success rate. Experimental results across a range of tasks and scenes demonstrate the effectiveness of this approach in facilitating real-world robots to accomplish tasks with the knowledge derived from human demonstrations.
- Chatgpt empowered long-step robot control in various environments: A case application. arXiv preprint arXiv:2304.03893, 2023.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022.
- Creative robot tool use with large language models. arXiv preprint arXiv:2310.13065, 2023.
- Generalizable long-horizon manipulations with large language models. arXiv preprint arXiv:2310.02264, 2023.
- Grid: Scene-graph-based instruction-driven robotic task planning. arXiv preprint arXiv:2309.07726, 2023.
- Interactive task planning with language models. arXiv preprint arXiv:2310.10645, 2023.
- Distilling and retrieving generalizable knowledge for robot manipulation via language corrections. arXiv preprint arXiv:2311.10678, 2023.
- Reflect: Summarizing robot experiences for failure explanation and correction. arXiv preprint arXiv:2306.15724, 2023.
- Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web. In Proceedings of the AAAI conference on artificial intelligence, volume 29, 2015.
- Learning robot activities from first-person human videos using convolutional future regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1–2, 2017.
- Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration. arXiv preprint arXiv:2311.12015, 2023.
- Stefan Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.
- Recent advances in robot learning from demonstration. Annual review of control, robotics, and autonomous systems, 3:297–330, 2020.
- Concept2robot: Learning manipulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40(12-14):1419–1434, 2021.
- Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, pages 570–587. Springer, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- Gpt-neox-20b: An open-source autoregressive language model, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Falcon-40b: an open large language model with state-of-the-art performance. Findings of the Association for Computational Linguistics: ACL, 2023:10755–10773, 2023.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models, 2023.
- Text2motion: from natural language instructions to feasible plans. Autonomous Robots, 47(8):1345–1365, November 2023.
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Vision-language foundation models as effective robot imitators. arXiv preprint arXiv:2311.01378, 2023.
- Mutex: Learning unified policies from multimodal task specifications. arXiv preprint arXiv:2309.14320, 2023.
- Robot task planning and situation handling in open worlds. arXiv preprint arXiv:2210.01287, 2022.
- Task and motion planning with large language models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. In Conference on Robot Learning, pages 287–318. PMLR, 2023.
- Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
- Robot learning from demonstration by constructing skill trees. The International Journal of Robotics Research, 31(3):360–375, 2012.
- A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
- Learning generalizable robotic reward functions from” in-the-wild” human videos. arXiv preprint arXiv:2103.16817, 2021.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
- Avid: Learning multi-stage tasks via pixel-level translation of human videos. arXiv preprint arXiv:1912.04443, 2019.
- Learning by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7827–7834. IEEE, 2021.
- Transformers for one-shot visual imitation. In Conference on Robot Learning, pages 2071–2084. PMLR, 2021.
- R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
- A learning-from-observation framework: One-shot robot teaching for grasp-manipulation-release household operations. In 2021 IEEE/SICE International Symposium on System Integration (SII), pages 461–466. IEEE, 2021.
- Segment anything. arXiv:2304.02643, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Guangyan Chen (5 papers)
- Te Cui (4 papers)
- Tianxing Zhou (2 papers)
- Zicai Peng (2 papers)
- Mengxiao Hu (4 papers)
- Meiling Wang (14 papers)
- Yi Yang (856 papers)
- Yufeng Yue (28 papers)
- Haoyang Lu (3 papers)
- Haizhou Li (286 papers)