Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation (2401.07487v1)
Abstract: Enabling robotic manipulation that generalizes to out-of-distribution scenes is a crucial step toward open-world embodied intelligence. For human beings, this ability is rooted in the understanding of semantic correspondence among objects, which naturally transfers the interaction experience of familiar objects to novel ones. Although robots lack such a reservoir of interaction experience, the vast availability of human videos on the Internet may serve as a valuable resource, from which we extract an affordance memory including the contact points. Inspired by the natural way humans think, we propose Robo-ABC: when confronted with unfamiliar objects that require generalization, the robot can acquire affordance by retrieving objects that share visual or semantic similarities from the affordance memory. The next step is to map the contact points of the retrieved objects to the new object. While establishing this correspondence may present formidable challenges at first glance, recent research finds it naturally arises from pre-trained diffusion models, enabling affordance mapping even across disparate object categories. Through the Robo-ABC framework, robots may generalize to manipulate out-of-category objects in a zero-shot manner without any manual annotation, additional training, part segmentation, pre-coded knowledge, or viewpoint restrictions. Quantitatively, Robo-ABC significantly enhances the accuracy of visual affordance retrieval by a large margin of 31.6% compared to state-of-the-art (SOTA) end-to-end affordance models. We also conduct real-world experiments of cross-category object-grasping tasks. Robo-ABC achieved a success rate of 85.7%, proving its capacity for real-world tasks.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision. Springer, 2020, pp. 213–229.
- Y. Zhu, A. Fathi, and L. Fei-Fei, “Reasoning about object affordances in a knowledge base representation,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13. Springer, 2014, pp. 408–424.
- J. J. Gibson, “The ecological approach to the visual perception of pictures,” Leonardo, vol. 11, no. 3, pp. 227–235, 1978.
- A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos, “Affordance detection of tool parts from geometric features,” in 2015 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2015, pp. 1374–1381.
- S. Amir, Y. Gandelsman, S. Bagon, and T. Dekel, “Deep vit features as dense visual descriptors,” arXiv preprint arXiv:2112.05814, vol. 2, no. 3, p. 4, 2021.
- Z. Jiang, H. Jiang, and Y. Zhu, “Doduo: Dense visual correspondence from unsupervised semantic-aware flow,” in arXiv preprint arXiv:2309.15110, 2023.
- L. Tang, M. Jia, Q. Wang, C. P. Phoo, and B. Hariharan, “Emergent correspondence from image diffusion,” arXiv preprint arXiv:2306.03881, 2023.
- E. Hedlin, G. Sharma, S. Mahajan, H. Isack, A. Kar, A. Tagliasacchi, and K. M. Yi, “Unsupervised semantic correspondence using stable diffusion,” arXiv preprint arXiv:2305.15581, 2023.
- S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” arXiv preprint arXiv:2203.12601, 2022.
- D. Shan, J. Geng, M. Shu, and D. F. Fouhey, “Understanding human hands in contact at internet scale,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9869–9878.
- S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 778–13 790.
- M. Goyal, S. Modi, R. Goyal, and S. Gupta, “Human hands as probes for interactive object understanding,” in Computer Vision and Pattern Recognition (CVPR), 2022.
- H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y. Xie, and C. Lu, “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,” IEEE Transactions on Robotics, 2023.
- A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian et al., “Do as i can, not as i say: Grounding language in robotic affordances,” in Conference on Robot Learning. PMLR, 2023, pp. 287–318.
- Y. Wang, Z. Li, M. Zhang, K. Driggs-Campbell, J. Wu, L. Fei-Fei, and Y. Li, “d3superscript𝑑3d^{3}italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT fields: Dynamic 3d descriptor fields for zero-shot generalizable robotic manipulation,” arXiv preprint arXiv:2309.16118, 2023.
- Z. Xue, Z. Yuan, J. Wang, X. Wang, Y. Gao, and H. Xu, “Useek: Unsupervised se (3)-equivariant 3d keypoints for generalizable manipulation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 1715–1722.
- H. Geng, H. Xu, C. Zhao, C. Xu, L. Yi, S. Huang, and H. Wang, “Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7081–7091.
- H. Geng, Z. Li, Y. Geng, J. Chen, H. Dong, and H. Wang, “Partmanip: Learning cross-category generalizable part manipulation policy from point cloud observations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2978–2988.
- Y. Geng, B. An, H. Geng, Y. Chen, Y. Yang, and H. Dong, “Rlafford: End-to-end affordance learning for robotic manipulation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 5880–5886.
- C. Ning, R. Wu, H. Lu, K. Mo, and H. Dong, “Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects,” arXiv preprint arXiv:2309.07473, 2023.
- K. Cheng, R. Wu, Y. Shen, C. Ning, G. Zhan, and H. Dong, “Learning environment-aware affordance for 3d articulated object manipulation under occlusions,” arXiv preprint arXiv:2309.07510, 2023.
- Y. Wang, R. Wu, K. Mo, J. Ke, Q. Fan, L. J. Guibas, and H. Dong, “Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions,” in European Conference on Computer Vision. Springer, 2022, pp. 90–107.
- R. Wu, Y. Zhao, K. Mo, Z. Guo, Y. Wang, T. Wu, Q. Fan, X. Chen, L. Guibas, and H. Dong, “Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects,” arXiv preprint arXiv:2106.14440, 2021.
- K. Mo, Y. Qin, F. Xiang, H. Su, and L. Guibas, “O2o-afford: Annotation-free large-scale object-object affordance learning,” in Conference on Robot Learning. PMLR, 2022, pp. 1666–1677.
- K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani, “Where2act: From pixels to actions for articulated 3d objects,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6813–6823.
- Y.-H. Wu, J. Wang, and X. Wang, “Learning generalizable dexterous manipulation from human grasp affordance,” in Conference on Robot Learning. PMLR, 2023, pp. 618–629.
- A. Kannan, K. Shaw, S. Bahl, P. Mannam, and D. Pathak, “Deft: Dexterous fine-tuning for real-world hand policies,” arXiv preprint arXiv:2310.19797, 2023.
- T. Nagarajan, C. Feichtenhofer, and K. Grauman, “Grounded human-object interaction hotspots from video,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8688–8697.
- S. Liu, S. Tripathi, S. Majumdar, and X. Wang, “Joint hand motion and interaction hotspots prediction from egocentric videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3282–3292.
- H. Luo, W. Zhai, J. Zhang, Y. Cao, and D. Tao, “Learning affordance grounding from exocentric images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2252–2261.
- Y. Ye, X. Li, A. Gupta, S. De Mello, S. Birchfield, J. Song, S. Tulsiani, and S. Liu, “Affordance diffusion: Synthesizing hand-object interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 479–22 489.
- Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Affordance transfer learning for human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 495–504.
- M. Hassan and A. Dharmaratne, “Attribute based affordance detection from human-object interaction images,” in Image and Video Technology–PSIVT 2015 Workshops: RV 2015, GPID 2013, VG 2015, EO4AS 2015, MCBMIIA 2015, and VSWS 2015, Auckland, New Zealand, November 23-27, 2015. Revised Selected Papers 7. Springer, 2016, pp. 220–232.
- D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” International Journal of Computer Vision, pp. 1–23, 2022.
- D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling egocentric vision: The epic-kitchens dataset,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 720–736.
- ——, “The epic-kitchens dataset: Collection, challenges and baselines,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4125–4141, 2020.
- K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012.
- R. Mendonca, S. Bahl, and D. Pathak, “Structured world models from human videos,” RSS, 2023.
- S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” RSS, 2022.
- J. Zhang, C. Herrmann, J. Hur, L. P. Cabrera, V. Jampani, D. Sun, and M.-H. Yang, “A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence,” arXiv preprint arXiv:2305.15347, 2023.
- K. Fang, T.-L. Wu, D. Yang, S. Savarese, and J. J. Lim, “Demo2vec: Reasoning object affordances from online videos,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- F. Saxen and A. Al-Hamadi, “Color-based skin segmentation: An evaluation of the state of the art,” in 2014 IEEE International Conference on Image Processing (ICIP). IEEE, 2014, pp. 4467–4471.
- S. H. Creem-Regehr and J. N. Lee, “Neural representations of graspable objects: are tools special?” Cognitive Brain Research, vol. 22, no. 3, pp. 457–469, 2005.
- L. Medeiros, “lang-segment-anything,” https://github.com/luca-medeiros/lang-segment-anything, 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021.
- Y. Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang, H. Geng, Y. Weng, J. Chen et al., “Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4737–4746.
- W. Wan, H. Geng, Y. Liu, Z. Shan, Y. Yang, L. Yi, and H. Wang, “Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning,” arXiv preprint arXiv:2304.00464, 2023.
- A. Rashid, S. Sharma, C. M. Kim, J. Kerr, L. Y. Chen, A. Kanazawa, and K. Goldberg, “Language embedded radiance fields for zero-shot task-oriented grasping,” in 7th Annual Conference on Robot Learning, 2023.
- P. Mandikal and K. Grauman, “Learning dexterous grasping with object-centric visual affordances,” in 2021 IEEE international conference on robotics and automation (ICRA). IEEE, 2021, pp. 6169–6176.
- ——, “Dexvip: Learning dexterous grasping with human hand pose priors from video,” in Conference on Robot Learning. PMLR, 2022, pp. 651–661.
- P. Li, T. Liu, Y. Li, Y. Geng, Y. Zhu, Y. Yang, and S. Huang, “Gendexgrasp: Generalizable dexterous grasping,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 8068–8074.
- S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg et al., “A generalist agent,” arXiv preprint arXiv:2205.06175, 2022.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018.
- G. Luo, L. Dunlap, D. H. Park, A. Holynski, and T. Darrell, “Diffusion hyperfeatures: Searching through time and space for semantic correspondence,” in Advances in Neural Information Processing Systems, 2023.
- W. Ye, Y. Zhang, M. Wang, S. Wang, X. Gu, P. Abbeel, and Y. Gao, “Foundation reinforcement learning: towards embodied generalist agents with foundation prior assistance,” arXiv preprint arXiv:2310.02635, 2023.
- J. Gao, K. Hu, G. Xu, and H. Xu, “Can pre-trained text-to-image models generate visual goals for reinforcement learning?” arXiv preprint arXiv:2307.07837, 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- H. Li, S. Dikhale, S. Iba, and N. Jamali, “Vihope: Visuotactile in-hand object 6d pose estimation with shape completion,” IEEE Robotics and Automation Letters, 2023.
- Y.-L. Li, H. Fan, Z. Qiu, Y. Dou, L. Xu, H.-S. Fang, P. Guo, H. Su, D. Wang, W. Wu et al., “Discovering a variety of objects in spatio-temporal human-object interactions,” arXiv preprint arXiv:2211.07501, 2022.
- Y. Zhao, R. Wu, Z. Chen, Y. Zhang, Q. Fan, K. Mo, and H. Dong, “Dualafford: Learning collaborative visual affordance for dual-gripper object manipulation,” arXiv preprint arXiv:2207.01971, 2022.
- Z. Xue, H. Zhang, J. Cheng, Z. He, Y. Ju, C. Lin, G. Zhang, and H. Xu, “Arraybot: Reinforcement learning for generalizable distributed manipulation through touch,” arXiv preprint arXiv:2306.16857, 2023.
- N. Di Palo and E. Johns, “On the effectiveness of retrieval, alignment, and replay in manipulation,” IEEE Robotics and Automation Letters, 2024.
- Yuanchen Ju (6 papers)
- Kaizhe Hu (10 papers)
- Guowei Zhang (25 papers)
- Gu Zhang (33 papers)
- Mingrun Jiang (1 paper)
- Huazhe Xu (93 papers)