Reasoning Grasping via Multimodal Large Language Model (2402.06798v3)
Abstract: Despite significant progress in robotic systems for operation within human-centric environments, existing models still heavily rely on explicit human commands to identify and manipulate specific objects. This limits their effectiveness in environments where understanding and acting on implicit human intentions are crucial. In this study, we introduce a novel task: reasoning grasping, where robots need to generate grasp poses based on indirect verbal instructions or intentions. To accomplish this, we propose an end-to-end reasoning grasping model that integrates a multimodal LLM with a vision-based robotic grasping framework. In addition, we present the first reasoning grasping benchmark dataset generated from the GraspNet-1 billion, incorporating implicit instructions for object-level and part-level grasping. Our results show that directly integrating CLIP or LLaVA with the grasp detection model performs poorly on the challenging reasoning grasping tasks, while our proposed model demonstrates significantly enhanced performance both in the reasoning grasping benchmark and real-world experiments.
- Interactive text2pickup networks for natural language-based human–robot collaboration. IEEE Robotics and Automation Letters, 3(4):3308–3315, 2018.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- End-to-end trainable deep neural network for robotic grasp detection and semantic segmentation from rgb. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13452–13458. IEEE, 2021.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Learning 6-dof object poses to grasp category-level objects by language instructions. In 2022 International Conference on Robotics and Automation (ICRA), pages 8476–8482. IEEE, 2022.
- A joint network for grasp detection conditioned on natural language commands. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4576–4582. IEEE, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Jacquard: A large scale dataset for robotic grasp detection. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3511–3516. IEEE, 2018.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Fast graspability evaluation on single depth maps for bin picking with general grippers. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pages 1997–2004. IEEE, 2014.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Acronym: A large-scale grasp dataset based on simulation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6222–6227. IEEE, 2021.
- Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020.
- Metagraspnet: A large-scale benchmark dataset for scene-aware ambidextrous bin picking via physics-based metaverse synthesis. In 2022 IEEE 18th International Conference on Automation Science and Engineering (CASE), pages 220–227. IEEE, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022.
- Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
- Efficient grasping from rgbd images: Learning using a new rectangle representation. In 2011 IEEE International conference on robotics and automation, pages 3304–3311. IEEE, 2011a.
- Efficient grasping from rgbd images: Learning using a new rectangle representation. In 2011 IEEE International conference on robotics and automation, pages 3304–3311. IEEE, 2011b.
- Alphablock: Embodied finetuning for vision-language reasoning in robot manipulation. arXiv preprint arXiv:2305.18898, 2023.
- Antipodal robotic grasping using generative residual convolutional neural network. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9626–9633. IEEE, 2020.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Llava-1.6: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-1-6/.
- VL-Grasp: a 6-Dof Interactive Grasp Policy for Language-Oriented Objects in Cluttered Indoor Scenes. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 976–983. IEEE, 2023.
- Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
- Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In 2010 IEEE International Conference on Robotics and Automation, pages 2308–2315. IEEE, 2010.
- Lan-grasp: Using large language models for semantic object grasping. arXiv preprint arXiv:2310.05239, 2023.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023.
- Learning robust, real-time, reactive robotic grasping. The International journal of robotics research, 39(2-3):183–201, 2020.
- OpenAI. Chatgpt. URL chat.openai.com.
- Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Grasp quality measures: review and performance. Autonomous robots, 38:65–88, 2015.
- Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
- Human-in-the-loop robotic grasping using bert scene representation. arXiv preprint arXiv:2209.14026, 2022.
- Language Guided Robotic Grasping with Fine-Grained Instructions. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1319–1326. IEEE, 2023.
- Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping. IEEE Robotics and Automation Letters, 2023a.
- Task-oriented grasp prediction with visual-language inputs. arXiv preprint arXiv:2302.14355, 2023b.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter. arXiv preprint arXiv:2311.05779, 2023.
- Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
- Grasp-anything: Large-scale grasp dataset from foundation models. arXiv preprint arXiv:2309.09818, 2023.
- A joint modeling of vision-language-action for target-oriented grasping in clutter. arXiv preprint arXiv:2302.12610, 2023.
- Attribute-based robotic grasping with one-grasp adaptation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6357–6363. IEEE, 2021.
- Interactive robotic grasping with attribute-guided disambiguation. In 2022 International Conference on Robotics and Automation (ICRA), pages 8914–8920. IEEE, 2022.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023.
- Invigorate: Interactive visual grounding and grasping in clutter. arXiv preprint arXiv:2108.11092, 2021.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Shiyu Jin (13 papers)
- Jinxuan Xu (4 papers)
- Yutian Lei (15 papers)
- Liangjun Zhang (51 papers)