Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation (2402.14623v1)

Published 22 Feb 2024 in cs.RO, cs.AI, cs.CL, and cs.CV

Abstract: Rapid progress in high-level task planning and code generation for open-world robot manipulation has been witnessed in Embodied AI. However, previous studies put much effort into general common sense reasoning and task planning capabilities of large-scale language or multi-modal models, relatively little effort on ensuring the deployability of generated code on real robots, and other fundamental components of autonomous robot systems including robot perception, motion planning, and control. To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language. The RobotScript platform addresses this gap by emphasizing the unified interface with both simulation and real robots, based on abstraction from the Robot Operating System (ROS), ensuring syntax compliance and simulation validation with Gazebo. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms, and multiple grippers. Additionally, our benchmark assesses reasoning abilities for physical space and constraints, highlighting the differences between GPT-3.5, GPT-4, and Gemini in handling complex physical interactions. Finally, we present a thorough evaluation on the whole system, exploring how each module in the pipeline: code generation, perception, motion planning, and even object geometric properties, impact the overall performance of the system.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
  2. A robust statistics approach for plane detection in unorganized point clouds. Pattern Recognition, 100:107115, 2020.
  3. Volumetric grasping network: Real-time 6 dof grasp detection in clutter. In Conference on Robot Learning, pages 1602–1611. PMLR, 2021.
  4. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 international conference on advanced robotics (ICAR), pages 510–517. IEEE, 2015a.
  7. Benchmarking in manipulation research: The ycb object and model set and benchmarking protocols. arXiv preprint arXiv:1502.03143, 2015b.
  8. Open-vocabulary queryable scene representations for real world planning. arXiv preprint arXiv:2209.09874, 2022.
  9. Reducing the barrier to entry of complex robotic software: a moveit! case study. ArXiv, abs/1404.3785, 2014. URL https://api.semanticscholar.org/CorpusID:13939653.
  10. Task and motion planning with large language models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023.
  11. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  12. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
  13. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023a.
  14. Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023b.
  15. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11444–11453, 2020.
  16. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics, 2023.
  17. A robostack tutorial: Using the robot operating system alongside the conda and jupyter data science ecosystems. IEEE Robotics and Automation Magazine, 2021. doi: 10.1109/MRA.2021.3128367.
  18. Li Gang and Jingfang Wang. Prm path planning optimization algorithm research. Wseas Transactions on Systems and control, 11:81–86, 2016.
  19. Object discovery and grasp detection with a shared convolutional neural network. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 2038–2043. IEEE, 2016.
  20. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  21. The franka emika robot: A reference platform for robotics research and education. IEEE Robotics & Automation Magazine, 29(2):46–64, 2022.
  22. Octomap: An efficient probabilistic 3d mapping framework based on octrees. Autonomous robots, 34:189–206, 2013.
  23. Visual language maps for robot navigation. arXiv preprint arXiv:2210.05714, 2022a.
  24. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. International Conference on Machine Learning, pages 4558–4568, 2022b.
  25. Inner monologue: Embodied reasoning through planning with language models. In arXiv preprint arXiv:2207.05608, 2022c.
  26. Grounded decoding: Guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855, 2023a.
  27. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023b.
  28. Peter Adriaan Jansen. Visually-grounded planning without vision: Language models infer detailed plans from high-level instructions. arXiv preprint arXiv:2009.14259, 2020.
  29. Roml: A robust feature correspondence approach for matching objects in a set of images. International Journal of Computer Vision, 117:173–197, 2016.
  30. Vima: General robot manipulation with multimodal prompts. arXiv, 2022.
  31. Synergies between affordance and geometry: 6-dof grasp detection via implicit representations. arXiv preprint arXiv:2104.01542, 2021.
  32. Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. arXiv preprint arXiv:2401.07487, 2024.
  33. Sampling-based algorithms for optimal motion planning. International Journal of Robotics Research, 30(7):846–894, 2011.
  34. Kinematic and dynamic modelling of ur5 manipulator. In 2016 IEEE international conference on systems, man, and cybernetics (SMC), pages 004229–004234. IEEE, 2016.
  35. Grounded language-image pre-training. In CVPR, 2022.
  36. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
  37. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023. URL https://ar5iv.org/abs/2303.12153.
  38. Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024.
  39. Marching cubes: A high resolution 3d surface construction algorithm. In Seminal graphics: pioneering efforts that shaped the field, pages 347–353. 1998.
  40. Multimodal procedural planning via dual text-image prompting. arXiv preprint arXiv:2305.01795, 2023.
  41. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  42. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023a.
  43. R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023b.
  44. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  45. Pretrained language models as visual planners for human assistance. arXiv preprint arXiv:2304.09179, 2023.
  46. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  47. Planning with large language models via corrective re-prompting. arXiv preprint arXiv:2211.09935, 2022.
  48. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135, 2023.
  49. Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. arXiv preprint arXiv:2207.04429, 2022.
  50. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
  51. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
  52. Llm-planner: Fewshot grounded planning for embodied agents with large language models. arXiv preprint arXiv:2212.04088, 2022.
  53. The open motion planning library. IEEE Robotics & Automation Magazine, 19(4):72–82, 2012.
  54. Neuralrecon: Real-time coherent 3d reconstruction from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15598–15607, 2021.
  55. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  56. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  57. Chatgpt for robotics: Design principles and model abilities. Technical Report MSR-TR-2023-8, Microsoft Research, February 2023. URL https://www.microsoft.com/en-us/research/publication/chatgpt-for-robotics-design-principles-and-model-abilities/.
  58. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  59. Gensim: Generating robotic simulation tasks via large language models. arXiv preprint arXiv:2310.01361, 2023b.
  60. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. arXiv preprint arXiv:2311.01455, 2023c.
  61. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  62. Embodied task planning with large language models. arXiv preprint arXiv:2307.01848, 2023.
  63. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.
  64. Translating natural language to planning goals with large-language models. arXiv preprint arXiv:2302.05128, 2023.
  65. Pave the way to grasp anything: Transferring foundation models for universal pick-place robots. arXiv preprint arXiv:2306.05716, 2023.
  66. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023a.
  67. Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582, 2023b.
  68. Homerobot: Open vocab mobile manipulation, 2023. URL https://aihabitat.org/static/challenge/home_robot_ovmm_2023/OVMM.pdf.
  69. Gamma: Generalizable articulation modeling and manipulation for articulated objects. arXiv preprint arXiv:2309.16264, 2023.
  70. A novel vision-based grasping method under occlusion for manipulating robotic system. IEEE Sensors Journal, 20(18):10996–11006, 2020.
  71. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023.
  72. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  73. Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021.
  74. A multi-task convolutional neural network for autonomous robotic grasping in object stacking scenes. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6435–6442. IEEE, 2019.
  75. Dense scene reconstruction with points of interest. ACM Transactions on Graphics (ToG), 32(4):1–8, 2013.
  76. Open3d: A modern library for 3d data processing. arXiv preprint arXiv:1801.09847, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Junting Chen (41 papers)
  2. Yao Mu (58 papers)
  3. Qiaojun Yu (15 papers)
  4. Tianming Wei (3 papers)
  5. Silang Wu (2 papers)
  6. Zhecheng Yuan (18 papers)
  7. Zhixuan Liang (14 papers)
  8. Chao Yang (333 papers)
  9. Kaipeng Zhang (73 papers)
  10. Wenqi Shao (89 papers)
  11. Yu Qiao (563 papers)
  12. Huazhe Xu (93 papers)
  13. Mingyu Ding (82 papers)
  14. Ping Luo (340 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.