Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis (2402.16117v1)

Published 25 Feb 2024 in cs.RO, cs.AI, and cs.CV

Abstract: Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal LLMs for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one navigation task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (103)
  1. Physical reasoning and object planning for household embodied agents. arXiv preprint arXiv:2311.13577, 2023.
  2. Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
  5. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  6. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  7. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  8. Bridging zero-shot object navigation and foundation models through pixel-guided navigation skill. arXiv preprint arXiv:2309.10309, 2023.
  9. Benchmarking in manipulation research: The ycb object and model set and benchmarking protocols. arXiv preprint arXiv:1502.03143, 2015.
  10. Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  11509–11522. IEEE, 2023a.
  11. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  12. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023c.
  13. Pali: A jointly-scaled multilingual language-image model. In ICLR, 2022.
  14. Learning dexterous manipulation from exemplar object trajectories and pre-grasps. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  3889–3896. IEEE, 2023.
  15. Task and motion planning with large language models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023.
  16. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492, 2023.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  18. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pp.  2553–2560. IEEE, 2022.
  19. Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
  20. Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. arXiv preprint arXiv:2205.04382, 2022.
  21. Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics, 2023.
  22. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  23. Physically grounded vision-language models for robotic manipulation. arXiv preprint arXiv:2309.02561, 2023.
  24. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18995–19012, 2022.
  25. The franka emika robot: A reference platform for robotics research and education. IEEE Robotics & Automation Magazine, 29(2):46–64, 2022.
  26. Tree-planner: Efficient close-loop task planning with large language models. arXiv preprint arXiv:2310.08582, 2023.
  27. Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  10608–10615. IEEE, 2023a.
  28. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pp.  9118–9147. PMLR, 2022a.
  29. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
  30. Grounded decoding: Guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855, 2023b.
  31. Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European conference on computer vision (ECCV), pp.  754–769, 2018.
  32. Egobridge: A dataset for bridging asynchronous first- and third-person views of activities in the real world. 2023c. URL https://egobridge.github.io/static/videos/egobridge_paper.pdf.
  33. Kinematic and dynamic modelling of ur5 manipulator. In 2016 IEEE international conference on systems, man, and cybernetics (SMC), pp.  004229–004234. IEEE, 2016.
  34. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation, 2023.
  35. Khatib, O. A unified approach for motion and force control of robot manipulators: The operational space formulation. IEEE Journal on Robotics and Automation, 3(1):43–53, 1987.
  36. Koubâa, A. et al. Robot Operating System (ROS)., volume 1. Springer, 2017.
  37. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  38. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  39. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022a.
  40. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  41. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023c.
  42. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10965–10975, 2022b.
  43. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. arXiv preprint arXiv:2312.16217, 2023d.
  44. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023e.
  45. Code as Policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation (ICRA), pp.  9493–9500. IEEE, 2023.
  46. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023a.
  47. Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310, 2023b.
  48. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023c.
  49. Visual instruction tuning. NeurIPS, 2023d.
  50. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023e.
  51. Akb-48: A real-world articulated object knowledge base. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14809–14818, 2022.
  52. An empirical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023a.
  53. Multimodal procedural planning via dual text-image prompting. arXiv preprint arXiv:2305.01795, 2023b.
  54. Lan-grasp: Using large language models for semantic object grasping. arXiv preprint arXiv:2310.05239, 2023.
  55. PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  56. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  57. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
  58. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023a.
  59. OpenAI. Gpt-4 technical report, 2023b.
  60. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  61. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
  62. Sinlu: Sinu-sigmoidal linear unit. Mathematics, 10(3):337, 2022.
  63. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  64. VirtualHome: Simulating household activities via programs. In CVPR, pp.  8494–8502, 2018.
  65. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
  66. Manipulation task simulation using ros and gazebo. In 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014), pp.  2594–2598. IEEE, 2014.
  67. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  68. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai. arXiv preprint arXiv:2109.08238, 2021.
  69. Planning with large language models via corrective re-prompting. In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
  70. Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. arXiv preprint arXiv:2307.06135, 2023.
  71. ProgPrompt: Generating situated robot task plans using large language models. In IEEE International Conference on Robotics and Automation (ICRA), pp.  11523–11530. IEEE, 2023.
  72. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2998–3009, 2023.
  73. The open motion planning library. IEEE Robotics & Automation Magazine, 19(4):72–82, 2012.
  74. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
  75. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  76. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
  77. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  78. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. NeurIPS, 2023b.
  79. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023c.
  80. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023a.
  81. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  803–814, 2023b.
  82. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11097–11107, 2020.
  83. Universal manipulation policy network for articulated objects. IEEE Robotics and Automation Letters, 7(2):2447–2454, 2022.
  84. Channel-wise attention-based network for self-supervised monocular depth estimation. In 2021 International Conference on 3D vision (3DV), pp.  464–473. IEEE, 2021.
  85. Octopus: Embodied vision-language programmer from environmental feedback. arXiv preprint arXiv:2310.08588, 2023a.
  86. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023b.
  87. mplug-docowl: Modularized multimodal large language model for document understanding, 2023.
  88. L3mvn: Leveraging large language models for visual target navigation. arXiv preprint arXiv:2304.05501, 2023a.
  89. Gamma: Generalizable articulation modeling and manipulation for articulated objects. arXiv preprint arXiv:2309.16264, 2023b.
  90. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023c.
  91. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023.
  92. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  93. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  94. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023b.
  95. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023c.
  96. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023d.
  97. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  98. 3d implicit transporter for temporally consistent keypoint discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3869–3880, 2023.
  99. Generalizable long-horizon manipulations with large language models. arXiv preprint arXiv:2310.02264, 2023a.
  100. Esc: Exploration with soft commonsense constraints for zero-shot object navigation. arXiv preprint arXiv:2301.13166, 2023b.
  101. Open3d: A modern library for 3d data processing. arXiv preprint arXiv:1801.09847, 2018.
  102. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  103. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023b.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.