Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition (2403.18062v1)

Published 26 Mar 2024 in cs.RO and cs.AI

Abstract: Task-oriented grasping of unfamiliar objects is a necessary skill for robots in dynamic in-home environments. Inspired by the human capability to grasp such objects through intuition about their shape and structure, we present a novel zero-shot task-oriented grasping method leveraging a geometric decomposition of the target object into simple, convex shapes that we represent in a graph structure, including geometric attributes and spatial relationships. Our approach employs minimal essential information - the object's name and the intended task - to facilitate zero-shot task-oriented grasping. We utilize the commonsense reasoning capabilities of LLMs to dynamically assign semantic meaning to each decomposed part and subsequently reason over the utility of each part for the intended task. Through extensive experiments on a real-world robotics platform, we demonstrate that our grasping approach's decomposition and reasoning pipeline is capable of selecting the correct part in 92% of the cases and successfully grasping the object in 82% of the tasks we evaluate. Additional videos, experiments, code, and data are available on our project website: https://shapegrasp.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. A. Murali, W. Liu, K. Marino, S. Chernova, and A. Gupta, “Same object, different grasps: Data and semantic knowledge for task-oriented grasping,” in Conference on Robot Learning, 2020.
  2. A. Rashid, S. Sharma, C. M. Kim, J. Kerr, L. Y. Chen, A. Kanazawa, and K. Goldberg, “Language embedded radiance fields for zero-shot task-oriented grasping,” in Conference on Robot Learning, 2023.
  3. H. P. O. de Beeck, K. Torfs, and J. Wagemans, “Perceived shape similarity among unfamiliar objects and the organization of the human object vision pathway,” Journal of Neuroscience, vol. 28, no. 40, 2008.
  4. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. Girshick, “Segment anything,” arXiv:2304.02643, 2023.
  5. X. Wei, M. Liu, Z. Ling, and H. Su, “Approximate convex decomposition for 3d meshes with collision-aware concavity and tree search,” ACM Transactions on Graphics, vol. 41, no. 4, p. 1–18, Jul. 2022. [Online]. Available: http://dx.doi.org/10.1145/3528223.3530103
  6. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
  7. H. Zhang, J. Tang, S. Sun, and X. Lan, “Robotic grasping from classical to modern: A survey,” ArXiv, vol. abs/2202.03631, 2022.
  8. A. Ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp pose detection in point clouds,” The International Journal of Robotics Research, vol. 36, no. 13-14, pp. 1455–1473, 2017.
  9. B. S. Zapata-Impata, P. Gil, J. Pomares, and F. Torres, “Fast geometry-based computation of grasping points on three-dimensional point clouds,” International Journal of Advanced Robotic Systems, 2019.
  10. E. Jang, S. Vijayanarasimhan, P. Pastor, J. Ibarz, and S. Levine, “End-to-end learning of semantic grasping,” arXiv preprint arXiv:1707.01932, 2017.
  11. B. Zhao, H. Zhang, X. Lan, H. Wang, Z. Tian, and N. Zheng, “Regnet: Region-based grasp network for end-to-end grasp detection in point clouds,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 13 474–13 480.
  12. A. Alliegro, M. Rudorfer, F. Frattin, A. Leonardis, and T. Tommasi, “End-to-end learning to grasp via sampling from object point clouds,” IEEE Robotics and Automation Letters, vol. 7, no. 4, 2022.
  13. D. Paschalidou, L. V. Gool, and A. Geiger, “Learning unsupervised hierarchical part decomposition of 3d objects from a single rgb image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1060–1070.
  14. Z. Liu, W. T. Freeman, J. B. Tenenbaum, and J. Wu, “Physical primitive decomposition,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 3–19.
  15. A. Agarwal, S. Uppal, K. Shaw, and D. Pathak, “Dexterous functional grasping,” in Conference on Robot Learning, 2023.
  16. A. Murali, W. Liu, K. Marino, S. Chernova, and A. Gupta, “Same object, different grasps: Data and semantic knowledge for task-oriented grasping,” in Conference on robot learning.   PMLR, 2021.
  17. S. Bhagat, S. Stepputtis, J. Campbell, and K. Sycara, “Sample-efficient learning of novel visual concepts,” in Proceedings of The 2nd Conference on Lifelong Learning Agents, 2023.
  18. S. Bhagat, S. Stepputtis, J. Campbell, and K. P. Sycara, “Knowledge-guided short-context action anticipation in human-centric videos,” ArXiv, vol. abs/2309.05943, 2023.
  19. Y. Lin, C. Tang, F.-J. Chu, and P. A. Vela, “Using synthetic data and deep networks to recognize primitive shapes for object grasping,” in 2020 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2020, pp. 10 494–10 501.
  20. K. Fang, Y. Zhu, A. Garg, A. Kurenkov, V. Mehta, L. Fei-Fei, and S. Savarese, “Learning task-oriented grasping for tool manipulation from simulated self-supervision,” The International Journal of Robotics Research, vol. 39, pp. 202 – 216, 2018.
  21. R. Mirjalili, M. Krawez, S. Silenzi, Y. Blei, and W. Burgard, “Lan-grasp: Using large language models for semantic object grasping,” arXiv preprint arXiv:2310.05239, 2023.
  22. C. Tang, D. Huang, W. Ge, W. Liu, and H. Zhang, “Graspgpt: Leveraging semantic knowledge from a large language model for task-oriented grasping,” IEEE Robotics and Automation Letters, 2023.
  23. C. Tang, D. Huang, L. Meng, W. Liu, and H. Zhang, “Task-oriented grasp prediction with visual-language inputs,” arXiv preprint arXiv:2302.14355, 2023.
  24. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  25. J. Kerr, C. M. Kim, K. Goldberg, A. Kanazawa, and M. Tancik, “Lerf: Language embedded radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  26. W. Ding, S. Feng, Y. Liu, Z. Tan, V. Balachandran, T. He, and Y. Tsvetkov, “Knowledge crosswords: Geometric reasoning over structured knowledge with large language models,” ArXiv, vol. abs/2310.01290, 2023.
  27. J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” ArXiv, vol. abs/2212.10403, 2022.
  28. M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.
  29. P. Liu, Y. Orru, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “Ok-robot: What really matters in integrating open-knowledge models for robotics,” arXiv preprint arXiv:2401.12202, 2024.
  30. W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar et al., “Inner monologue: Embodied reasoning through planning with language models,” in Conference on Robot Learning.   PMLR, 2023, pp. 1769–1782.
  31. S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor, “Language-conditioned imitation learning for robot manipulation tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 13 139–13 150, 2020.
  32. Y. Zhou, S. Sonawani, M. Phielipp, H. Ben Amor, and S. Stepputtis, “Learning modular language-conditioned robot policies through attention,” Autonomous Robots, vol. 47, no. 8, pp. 1013–1033, 2023.
  33. A. Mousavian, C. Eppner, and D. Fox, “6-dof graspnet: Variational grasp generation for object manipulation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019.
  34. S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large language models and knowledge graphs: A roadmap,” ArXiv, vol. abs/2306.08302, 2023.
  35. F. Moiseev, Z. Dong, E. Alfonseca, and M. Jaggi, “SKILL: Structured knowledge infusion for large language models,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022.
  36. L. F. Yang, H. Chen, Z. Li, X. Ding, and X. Wu, “Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling,” ArXiv, vol. abs/2306.11489, 2023.
  37. Microsoft. [Online]. Available: https://microsoft.github.io/TypeChat/
  38. OpenAI, “Gpt-4 technical report,” 2023.
  39. M. Yuksekgonul, F. Bianchi, P. Kalluri, D. Jurafsky, and J. Zou, “When and why vision-language models behave like bags-of-words, and what to do about it?” in ICLR, 2023.
  40. B. Zhu, E. Frick, T. Wu, H. Zhu, and J. Jiao, “Starling-7b: Improving llm helpfulness & harmlessness with rlaif,” November 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Samuel Li (4 papers)
  2. Sarthak Bhagat (15 papers)
  3. Joseph Campbell (36 papers)
  4. Yaqi Xie (23 papers)
  5. Woojun Kim (20 papers)
  6. Katia Sycara (93 papers)
  7. Simon Stepputtis (38 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com