Natural Language as Policies: Reasoning for Coordinate-Level Embodied Control with LLMs (2403.13801v2)
Abstract: We demonstrate experimental results with LLMs that address robotics task planning problems. Recently, LLMs have been applied in robotics task planning, particularly using a code generation approach that converts complex high-level instructions into mid-level policy codes. In contrast, our approach acquires text descriptions of the task and scene objects, then formulates task planning through natural language reasoning, and outputs coordinate level control commands, thus reducing the necessity for intermediate representation code as policies with pre-defined APIs. Our approach is evaluated on a multi-modal prompt simulation benchmark, demonstrating that our prompt engineering experiments with natural language reasoning significantly enhance success rates compared to its absence. Furthermore, our approach illustrates the potential for natural language descriptions to transfer robotics skills from known tasks to previously unseen tasks. The project website: https://natural-language-as-policies.github.io/
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Learn to move through a combination of policy gradient algorithms: Ddpg, d4pg, and td3. In Machine Learning, Optimization, and Data Science: 6th International Conference, LOD 2020, Siena, Italy, July 19–23, 2020, Revised Selected Papers, Part II 6, pages 631–644. Springer, 2020.
- Planning with rl and episodic-memory behavioral priors. arXiv preprint arXiv:2207.01845, 2022.
- Robogpt: an intelligent agent of making embodied long-term decisions for daily instruction tasks. arXiv preprint arXiv:2311.15649, 2023.
- Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
- Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176, 2023.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
- Vima: General robot manipulation with multimodal prompts, 2023.
- Language models as zero-shot trajectory generators, 2023.
- Chain of code: Reasoning with a language model-augmented code emulator, 2023.
- Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning, 2023.
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
- Text2motion: from natural language instructions to feasible plans. Autonomous Robots, 47(8):1345–1365, November 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Zero-shot imitation policy via search in demonstration dataset. arXiv preprint arXiv:2401.16398, 2024.
- Behavioral cloning via search in video pretraining latent space. arXiv preprint arXiv:2212.13326, 2022.
- Neil Houlsby Matthias Minderer, Alexey Gritsenko. Scaling open-vocabulary object detection. NeurIPS, 2023.
- Uniteam: Open vocabulary mobile manipulation challenge. arXiv preprint arXiv:2312.08611, 2023.
- Using tactile sensing to improve the sample efficiency and performance of deep deterministic policy gradients for simulated in-hand manipulation tasks. Frontiers in Robotics and AI, 8:538773, 2021.
- Towards solving fuzzy tasks with human feedback: A retrospective of the minerl basalt 2022 competition. arXiv preprint arXiv:2303.13512, 2023.
- Robocodex: Multimodal code generation for robotic behavior synthesis, 2024.
- Embodiedgpt: Vision-language pre-training via embodied chain of thought. Advances in Neural Information Processing Systems, 36, 2024.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Contrastive language, action, and state pre-training for robot learning, 2023.
- Shape complexity estimation using vae. In Intelligent Systems Conference, pages 35–45. Springer, 2023.
- An approach to hierarchical deep reinforcement learning for a decentralized walking control architecture. In Biologically Inspired Cognitive Architectures 2018: Proceedings of the Ninth Annual Meeting of the BICA Society, pages 272–282. Springer, 2019.
- Progprompt: Generating situated robot task plans using large language models, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Multimodal few-shot learning with frozen language models, 2021.
- Demo2code: From summarizing demonstrations to synthesizing code via extended chain-of-thought, 2023.
- Prompt a robot to walk with large language models, 2023.
- Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Statler: State-maintaining language models for embodied reasoning. arXiv preprint arXiv:2306.17840, 2023.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Yusuke Mikami (1 paper)
- Andrew Melnik (33 papers)
- Jun Miura (11 papers)
- Ville Hautamäki (30 papers)