Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation (2312.16217v1)

Published 24 Dec 2023 in cs.CV and cs.RO

Abstract: Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However, learning-based robot manipulation, trained on a limited category within a simulator, often struggles to achieve generalizability, especially when confronted with extensive categories. Therefore, we introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal LLMs (MLLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm, encompassing object category understanding, affordance prior reasoning, and object-centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference, our approach utilizes an RGB image and text prompt to predict the end effector's pose in chain of thoughts. After the initial contact is established, an active impedance adaptation policy is introduced to plan the upcoming waypoints in a closed-loop manner. Moreover, in real world, we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experiments in simulator and real-world show the promising performance of ManipLLM. More details and demonstrations can be found at https://sites.google.com/view/manipLLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086, 2018.
  3. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
  4. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  5. Banana: Banach fixed-point network for pointcloud segmentation with inter-part equivariance. arXiv preprint arXiv:2305.16314, 2023.
  6. Flowbot3d: Learning 3d articulation flow to manipulate articulated objects. arXiv preprint arXiv:2205.04382, 2022.
  7. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
  8. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
  9. Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2978–2988, 2023a.
  10. Sage: Bridging semantic and actionable parts for generalizable articulated-object manipulation under language instructions. arXiv preprint arXiv:2312.01307, 2023b.
  11. Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7081–7091, 2023c.
  12. End-to-end affordance learning for robotic manipulation. In International Conference on Robotics and Automation (ICRA), 2023d.
  13. Rlafford: End-to-end affordance learning for robotic manipulation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5880–5886. IEEE, 2023e.
  14. Arnold: A benchmark for language-grounded task learning with continuous states in realistic 3d scenes. arXiv preprint arXiv:2304.04321, 2023.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  16. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
  17. Robotic grasping using deep reinforcement learning. In 2020 IEEE 16th International Conference on Automation Science and Engineering (CASE), pages 1461–1466. IEEE, 2020.
  18. Nap: Neural 3d articulation prior. arXiv preprint arXiv:2305.16315, 2023.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  20. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  21. Vida: Homeostatic visual domain adapter for continual test time adaptation. arXiv preprint arXiv:2306.04344, 2023b.
  22. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
  23. PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019a.
  24. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019b.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  27. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. arXiv preprint arXiv:2304.00464, 2023.
  28. Sparsedff: Sparse-view feature distillation for one-shot dexterous manipulation. arXiv preprint arXiv:2310.16838, 2023.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  30. Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects. arXiv preprint arXiv:2106.14440, 2021.
  31. SAPIEN: A simulated part-based interactive environment. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  32. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. arXiv preprint arXiv:2303.00938, 2023.
  33. Universal manipulation policy network for articulated objects. IEEE Robotics and Automation Letters, 7(2):2447–2454, 2022.
  34. Equivact: Sim(3)-equivariant visuomotor policies beyond rigid object manipulation. arXiv preprint arXiv:2310.16050, 2023a.
  35. Exploring sparse visual prompt for cross-domain semantic segmentation. arXiv preprint arXiv:2303.09792, 2023b.
  36. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.
  37. Make a donut: Language-guided hierarchical emd-space planning for zero-shot deformable object manipulation, 2023.
  38. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  39. 3d implicit transporter for temporally consistent keypoint discovery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3869–3880, 2023.
  40. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In 7th Annual Conference on Robot Learning, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Xiaoqi Li (77 papers)
  2. Mingxu Zhang (3 papers)
  3. Yiran Geng (14 papers)
  4. Haoran Geng (30 papers)
  5. Yuxing Long (9 papers)
  6. Yan Shen (30 papers)
  7. Renrui Zhang (100 papers)
  8. Jiaming Liu (156 papers)
  9. Hao Dong (175 papers)
Citations (51)