A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter (2302.12610v3)
Abstract: We focus on the task of language-conditioned grasping in clutter, in which a robot is supposed to grasp the target object based on a language instruction. Previous works separately conduct visual grounding to localize the target object, and generate a grasp for that object. However, these works require object labels or visual attributes for grounding, which calls for handcrafted rules in planner and restricts the range of language instructions. In this paper, we propose to jointly model vision, language and action with object-centric representation. Our method is applicable under more flexible language instructions, and not limited by visual grounding error. Besides, by utilizing the powerful priors from the pre-trained multi-modal model and grasp model, sample efficiency is effectively improved and the sim2real problem is relived without additional data for transfer. A series of experiments carried out in simulation and real world indicate that our method can achieve better task success rate by less times of motion under more flexible language instructions. Moreover, our method is capable of generalizing better to scenarios with unseen objects and language instructions. Our code is available at https://github.com/xukechun/Vision-Language-Grasping
- A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo et al., “Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” The International Journal of Robotics Research, vol. 41, no. 7, pp. 690–705, 2022.
- A. Murali, A. Mousavian, C. Eppner, C. Paxton, and D. Fox, “6-dof grasping for target-driven object manipulation in clutter,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 6232–6238.
- K. Fang, Y. Bai, S. Hinterstoisser, S. Savarese, and M. Kalakrishnan, “Multi-task domain adaptation for deep learning of instance grasping from simulation,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 3516–3523.
- E. Jang, S. Vijayanarasimhan, P. Pastor, J. Ibarz, and S. Levine, “End-to-end learning of semantic grasping,” in Conference on Robot Learning. PMLR, 2017, pp. 119–132.
- M. Sun and Y. Gao, “Gater: Learning grasp-action-target embeddings and relations for task-specific grasping,” IEEE Robotics and Automation Letters, vol. 7, no. 1, pp. 618–625, 2021.
- M. Laskey, J. Lee, C. Chuck, D. Gealy, W. Hsieh, F. T. Pokorny, A. D. Dragan, and K. Goldberg, “Robot grasping in clutter: Using a hierarchy of supervisors for learning from demonstrations,” in 2016 IEEE international conference on automation science and engineering (CASE). IEEE, 2016, pp. 827–834.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.
- W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar et al., “Inner monologue: Embodied reasoning through planning with language models,” arXiv preprint arXiv:2207.05608, 2022.
- M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on Robot Learning. PMLR, 2022, pp. 894–906.
- K. Zheng, X. Chen, O. C. Jenkins, and X. E. Wang, “Vlmbench: A compositional benchmark for vision-and-language manipulation,” arXiv preprint arXiv:2206.08522, 2022.
- J. Hatori, Y. Kikuchi, S. Kobayashi, K. Takahashi, Y. Tsuboi, Y. Unno, W. Ko, and J. Tan, “Interactively picking real-world objects with unconstrained spoken language instructions,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 3774–3781.
- M. Shridhar and D. Hsu, “Interactive visual grounding of referring expressions for human-robot interaction,” arXiv preprint arXiv:1806.03831, 2018.
- Y. Yang, X. Lou, and C. Choi, “Interactive robotic grasping with attribute-guided disambiguation,” arXiv preprint arXiv:2203.08037, 2022.
- H. Zhang, Y. Lu, C. Yu, D. Hsu, X. La, and N. Zheng, “Invigorate: Interactive visual grounding and grasping in clutter,” arXiv preprint arXiv:2108.11092, 2021.
- W. Goodwin, S. Vaze, I. Havoutis, and I. Posner, “Semantically grounded object matching for robust robotic scene rearrangement,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 11 138–11 144.
- H.-S. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale benchmark for general object grasping,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 444–11 453.
- L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours,” in 2016 IEEE international conference on robotics and automation (ICRA). IEEE, 2016, pp. 3406–3413.
- J. Mahler and K. Goldberg, “Learning deep policies for robot bin picking by simulating robust grasping sequences,” in Conference on robot learning. PMLR, 2017, pp. 515–524.
- D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke et al., “Scalable deep reinforcement learning for vision-based robotic manipulation,” in Conference on Robot Learning. PMLR, 2018, pp. 651–673.
- A. Ten Pas and R. Platt, “Using geometry to detect grasp poses in 3d point clouds,” in Robotics research. Springer, 2018, pp. 307–324.
- J. Mahler, J. Liang, S. Niyaz, M. Laskey, R. Doan, X. Liu, J. A. Ojea, and K. Goldberg, “Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics,” arXiv preprint arXiv:1703.09312, 2017.
- M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox, “Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 13 438–13 444.
- C. Wang, H.-S. Fang, M. Gou, H. Fang, J. Gao, and C. Lu, “Graspness discovery in clutters for fast and accurate grasp detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 964–15 973.
- D. Son, “Grasping as inference: Reactive grasping in heavily cluttered environment,” arXiv preprint arXiv:2205.13146, 2022.
- M. Kiatos and S. Malassiotis, “Robust object grasping in clutter via singulation,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 1596–1600.
- A. Kurenkov, J. Taglic, R. Kulkarni, M. Dominguez-Kuhne, A. Garg, R. Martín-Martín, and S. Savarese, “Visuomotor mechanical search: Learning to retrieve target objects in clutter,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 8408–8414.
- Y. Yang, H. Liang, and C. Choi, “A deep learning approach to grasping the invisible,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2232–2239, 2020.
- K. Xu, H. Yu, Q. Lai, Y. Wang, and R. Xiong, “Efficient learning of goal-oriented push-grasping synergy in clutter,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 6337–6344, 2021.
- B. Huang, S. D. Han, J. Yu, and A. Boularias, “Visual foresight trees for object retrieval from clutter with nonprehensile rearrangement,” IEEE Robotics and Automation Letters, vol. 7, no. 1, pp. 231–238, 2021.
- S. Stepputtis, J. Campbell, M. Phielipp, S. Lee, C. Baral, and H. Ben Amor, “Language-conditioned imitation learning for robot manipulation tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 13 139–13 150, 2020.
- A. B. Rao, K. Krishnan, and H. He, “Learning robotic grasping strategy based on natural-language object descriptions,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 882–887.
- H. Ito, H. Ichiwara, K. Yamamoto, H. Mori, and T. Ogata, “Integrated learning of robot motion and sentences: Real-time prediction of grasping motion and attention based on language instructions,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 5404–5410.
- Y. Chen, R. Xu, Y. Lin, and P. A. Vela, “A joint network for grasp detection conditioned on natural language commands,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 4576–4582.
- D. K. Misra, J. Sung, K. Lee, and A. Saxena, “Tell me dave: Context-sensitive grounding of natural language to manipulation instructions,” The International Journal of Robotics Research, vol. 35, no. 1-3, pp. 281–300, 2016.
- E. Stengel-Eskin, A. Hundt, Z. He, A. Murali, N. Gopalan, M. Gombolay, and G. Hager, “Guiding multi-step rearrangement tasks with natural language instructions,” in Conference on Robot Learning. PMLR, 2022, pp. 1486–1501.
- H. Ahn, S. Choi, N. Kim, G. Cha, and S. Oh, “Interactive text2pickup networks for natural language-based human–robot collaboration,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3308–3315, 2018.
- L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in European Conference on Computer Vision. Springer, 2016, pp. 69–85.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, 2020, pp. 405–421.
- T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
- P. Christodoulou, “Soft actor-critic for discrete action settings,” arXiv preprint arXiv:1910.07207, 2019.
- E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” http://pybullet.org, 2016–2021.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- Z. Liu, W. Liu, Y. Qin, F. Xiang, M. Gou, S. Xin, M. A. Roa, B. Calli, H. Su, Y. Sun et al., “Ocrtoc: A cloud-based competition and benchmark for robotic grasping and manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 1, pp. 486–493, 2021.
- E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- Kechun Xu (13 papers)
- Shuqi Zhao (6 papers)
- Zhongxiang Zhou (19 papers)
- Zizhang Li (15 papers)
- Huaijin Pi (12 papers)
- Yue Wang (676 papers)
- Rong Xiong (115 papers)