Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Models as Zero-Shot Trajectory Generators (2310.11604v2)

Published 17 Oct 2023 in cs.RO, cs.AI, cs.CL, cs.HC, and cs.LG

Abstract: LLMs have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation tasks, when given access to only object detection and segmentation vision models. We designed a single, task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers. Then we studied how well it can perform across 30 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigated which design choices in this prompt are the most important. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly. Videos, prompts, and code are available at: https://www.robot-learning.uk/language-models-trajectory-generators.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  2. OpenAI. GPT-4 Technical Report. arXiv e-prints, art. arXiv:2303.08774, Mar. 2023. doi:10.48550/arXiv.2303.08774.
  3. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv e-prints, art. arXiv:2307.09288, July 2023. doi:10.48550/arXiv.2307.09288.
  4. PaLM 2 Technical Report. arXiv e-prints, art. arXiv:2305.10403, May 2023. doi:10.48550/arXiv.2305.10403.
  5. A Survey on Large Language Model based Autonomous Agents. arXiv e-prints, art. arXiv:2308.11432, Aug. 2023. doi:10.48550/arXiv.2308.11432.
  6. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. arXiv e-prints, art. arXiv:2307.05973, July 2023. doi:10.48550/arXiv.2307.05973.
  7. Language to Rewards for Robotic Skill Synthesis. arXiv e-prints, art. arXiv:2306.08647, June 2023. doi:10.48550/arXiv.2306.08647.
  8. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. arXiv e-prints, art. arXiv:2204.01691, Apr. 2022. doi:10.48550/arXiv.2204.01691.
  9. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500, May 2023. doi:10.1109/ICRA48891.2023.10160591.
  10. Chatgpt for robotics: Design principles and model abilities. Technical Report MSR-TR-2023-8, Microsoft, February 2023. URL https://www.microsoft.com/en-us/research/publication/chatgpt-for-robotics-design-principles-and-model-abilities/.
  11. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv e-prints, art. arXiv:2307.15818, July 2023. doi:10.48550/arXiv.2307.15818.
  12. PaLM-e: An embodied multimodal language model. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 8469–8488. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/driess23a.html.
  13. Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf.
  14. L. Medeiros. Langsam: Language segment-anything. https://github.com/luca-medeiros/lang-segment-anything. Accessed: 2023-10-01.
  15. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv e-prints, art. arXiv:2303.05499, Mar. 2023. doi:10.48550/arXiv.2303.05499.
  16. Segment Anything. arXiv e-prints, art. arXiv:2304.02643, Apr. 2023. doi:10.48550/arXiv.2304.02643.
  17. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, pages 640–658, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-19815-1.
  18. Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models. arXiv e-prints, art. arXiv:2211.11736, Nov. 2022. doi:10.48550/arXiv.2211.11736.
  19. RT-1: Robotics Transformer for Real-World Control at Scale. arXiv e-prints, art. arXiv:2212.06817, Dec. 2022. doi:10.48550/arXiv.2212.06817.
  20. Scaling Robot Learning with Semantically Imagined Experience. arXiv e-prints, art. arXiv:2302.11550, Feb. 2023. doi:10.48550/arXiv.2302.11550.
  21. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
  22. Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/8bb0d291acd4acf06ef112099c16f326-Paper-Conference.pdf.
  23. Evaluating Large Language Models Trained on Code. arXiv e-prints, art. arXiv:2107.03374, July 2021. doi:10.48550/arXiv.2107.03374.
  24. Training Compute-Optimal Large Language Models. arXiv e-prints, art. arXiv:2203.15556, Mar. 2022. doi:10.48550/arXiv.2203.15556.
  25. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. arXiv e-prints, art. arXiv:2308.09583, Aug. 2023. doi:10.48550/arXiv.2308.09583.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Teyun Kwon (3 papers)
  2. Norman Di Palo (15 papers)
  3. Edward Johns (49 papers)
Citations (28)
Youtube Logo Streamline Icon: https://streamlinehq.com