PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models (2402.16836v1)
Abstract: Robotic grasping is a fundamental aspect of robot functionality, defining how robots interact with objects. Despite substantial progress, its generalizability to counter-intuitive or long-tailed scenarios, such as objects with uncommon materials or shapes, remains a challenge. In contrast, humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for objects they have never seen before. This work delves into infusing such physical commonsense reasoning into robotic manipulation. We introduce PhyGrasp, a multimodal large model that leverages inputs from two modalities: natural language and 3D point clouds, seamlessly integrated through a bridge module. The language modality exhibits robust reasoning capabilities concerning the impacts of diverse physical properties on grasping, while the 3D modality comprehends object shapes and parts. With these two capabilities, PhyGrasp is able to accurately assess the physical properties of object parts and determine optimal grasping poses. Additionally, the model's language comprehension enables human instruction interpretation, generating grasping poses that align with human preferences. To train PhyGrasp, we construct a dataset PhyPartNet with 195K object instances with varying physical properties and human preferences, alongside their corresponding language descriptions. Extensive experiments conducted in the simulation and on the real robots demonstrate that PhyGrasp achieves state-of-the-art performance, particularly in long-tailed cases, e.g., about 10% improvement in success rate over GraspNet. Project page: https://sites.google.com/view/phygrasp
- Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Physion: Evaluating physical prediction from vision in humans and machines. arXiv preprint arXiv:2106.08261, 2021.
- Trends and challenges in robot manipulation. Science, 364(6446):eaat8414, 2019.
- Volumetric grasping network: Real-time 6 dof grasp detection in clutter. In Conference on Robot Learning, pages 1602–1611. PMLR, 2021.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
- What do different evaluation metrics tell us about saliency models? IEEE transactions on pattern analysis and machine intelligence, 41(3):740–757, 2018.
- Open-vocabulary queryable scene representations for real world planning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11509–11522. IEEE, 2023a.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
- Pali: A jointly-scaled multilingual language-image model. In ICLR, 2022a.
- Comphy: Compositional physical reasoning of objects and events from videos. arXiv preprint arXiv:2205.01089, 2022b.
- Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2019.
- Toward next-generation learned robot manipulation. Science robotics, 6(54):eabd9461, 2021.
- Dynamic visual reasoning by learning differentiable physics models from video and language. Advances in Neural Information Processing Systems, 34:887–899, 2021.
- Task and motion planning with large language models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023.
- Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378, 2023.
- Graspnet-1billion: A large-scale benchmark for general object grasping. In CVPR, pages 11444–11453, 2020.
- Robust grasping across diverse sensor qualities: The graspnet-1billion dataset. The International Journal of Robotics Research, 2023a.
- Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics, 2023b.
- Demo2vec: Reasoning object affordances from online videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2139–2147, 2018.
- Physically grounded vision-language models for robotic manipulation. arXiv preprint arXiv:2309.02561, 2023.
- Tree-planner: Efficient close-loop task planning with large language models. arXiv preprint arXiv:2310.08582, 2023.
- Visual language maps for robot navigation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023a.
- Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR, 2022a.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
- Grounded decoding: Guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855, 2023b.
- Reasoning about physical interactions with object-oriented prediction and planning. In International Conference on Learning Representations, 2019.
- Synergies between affordance and geometry: 6-dof grasp detection via implicit representations. arXiv preprint arXiv:2104.01542, 2021.
- Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. arXiv preprint arXiv:2401.07487, 2024.
- Learning to predict where humans look. In 2009 IEEE 12th international conference on computer vision, pages 2106–2113. IEEE, 2009.
- Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
- Can language models understand physical concepts? arXiv preprint arXiv:2305.14057, 2023c.
- Visual grounding of learned physical models. In ICML, 2020.
- Code as Policies: Language model programs for embodied control. In IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
- Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023a.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
- Visual instruction tuning. NeurIPS, 2023c.
- Ok-robot: What really matters in integrating open-knowledge models for robotics. arXiv preprint arXiv:2401.12202, 2024.
- Mind’s eye: Grounded language model reasoning through simulation. In The Eleventh International Conference on Learning Representations, 2023d. URL https://openreview.net/forum?id=4rXMRuoJlai.
- An empirical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023a.
- Multimodal procedural planning via dual text-image prompting. arXiv preprint arXiv:2305.01795, 2023b.
- Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
- Learning ambidextrous robot grasping policies. Science Robotics, 4(26):eaau4984, 2019.
- Matthew T Mason. Toward robotic manipulation. Annual Review of Control, Robotics, and Autonomous Systems, 1:1–28, 2018.
- Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 922–928. IEEE, 2015.
- Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019.
- Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
- OpenAI. Chatgpt: Generative pre-trained transformer for conversational agents. https://chat.openai.com/, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- Openscene: 3d scene understanding with open vocabularies. In CVPR, 2023a.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023b.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
- Learning long-term visual dynamics with region proposal interaction networks. arXiv preprint arXiv:2008.02265, 2020.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Planning with large language models via corrective re-prompting. In NeurIPS 2022 Foundation Models for Decision Making Workshop, 2022.
- Grasp quality measures: review and performance. Autonomous robots, 38:65–88, 2015.
- ProgPrompt: Generating situated robot task plans using large language models. In IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530. IEEE, 2023.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.
- Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Physion++: Evaluating physical scene understanding that requires online inference of different physical properties. Advances in Neural Information Processing Systems, 36, 2024.
- Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20, 2023.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023b.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. NeurIPS, 2023c.
- Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. Advances in neural information processing systems, 28, 2015a.
- Physics 101: Learning physical object properties from unlabeled videos. In BMVC, page 7, 2016.
- Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
- 3d shapenets: A deep representation for volumetric shapes. In CVPR, June 2015b.
- Densephysnet: Learning dense physical object representations via multi-step dynamic interactions. In Robotics: Science and Systems (RSS), 2019.
- Octopus: Embodied vision-language programmer from environmental feedback. arXiv preprint arXiv:2310.08588, 2023a.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023b.
- mplug-docowl: Modularized multimodal large language model for document understanding, 2023.
- Gamma: Generalizable articulation modeling and manipulation for articulated objects. arXiv preprint arXiv:2309.16264, 2023.
- Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
- Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023b.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023c.
- Generalizable long-horizon manipulations with large language models. arXiv preprint arXiv:2310.02264, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
- 6-dof contrastive grasp proposal network. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6371–6377. IEEE, 2021.
- Learn to grasp with less supervision: A data-efficient maximum likelihood grasp sampling loss. In 2022 International Conference on Robotics and Automation (ICRA), pages 721–727. IEEE, 2022.
- Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023b.
- Interfacing foundation models’ embeddings. arXiv preprint arXiv:2312.07532, 2023.
- Dingkun Guo (4 papers)
- Yuqi Xiang (3 papers)
- Shuqi Zhao (6 papers)
- Xinghao Zhu (26 papers)
- Masayoshi Tomizuka (261 papers)
- Mingyu Ding (82 papers)
- Wei Zhan (130 papers)