Keypoint Abstraction using Large Models for Object-Relative Imitation Learning (2410.23254v1)
Abstract: Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-LLMs (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Website: https://kalm-il.github.io/
- Y. Zhu, A. Joshi, P. Stone, and Y. Zhu, “Viola: Imitation learning for vision-based manipulation with object proposal priors,” in Conference on Robotic Learning, 2023.
- W. Liu, J. Mao, J. Hsu, T. Hermans, A. Garg, and J. Wu, “Composable part-based manipulation,” in Conference on Robotic Learning, 2023.
- W. Gao and R. Tedrake, “kpam 2.0: Feedback control for category-level robotic manipulation,” IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2962–2969, 2021.
- W. Huang, C. Wang, Y. Li, R. Zhang, and L. Fei-Fei, “ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation,” in Conference on Robotic Learning, 2024.
- T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki, “3d diffuser actor: Policy diffusion with 3d scene representations,” in Conference on Robotic Learning, 2024.
- T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki, “Act3d: 3d feature field transformers for multi-task robotic manipulation,” in Conference on Robotic Learning, 2023.
- M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robotic Learning, 2022.
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Neural Information Processing Systems, 2022.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. Berg, W.-Y. Lo, P. Dollar, and R. Girshick, “Segment anything,” arXiv:2304.02643, 2023.
- Y. Eldar, M. Lindenbaum, M. Porat, and Y. Zeevi, “The farthest point strategy for progressive image sampling,” IEEE Transactions on Image Processing, vol. 6, no. 9, pp. 1305–1315, 1997.
- M. Janner, Y. Du, J. Tenenbaum, and S. Levine, “Planning with diffusion for flexible behavior synthesis,” in International Conference on Machine Learning, 2022.
- OpenAI, “Hello GPT-4o,” 2024, accessed: 2024-09-15. [Online]. Available: https://openai.com/index/hello-gpt-4o/
- M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al., “Dinov2: Learning robust visual features without supervision,” Transactions on Machine Learning Research, 2024.
- S. Fu, M. Hamilton, L. E. Brandt, A. Feldmann, Z. Zhang, and W. T. Freeman, “Featup: A model-agnostic framework for features at any resolution,” in International Conference on Learning Representations, 2024.
- R. B. Rusu, N. Blodow, and M. Beetz, “Fast point feature histograms (FPFH) for 3d registration,” in IEEE International Conference on Robotics and Automation, 2009.
- Y. Zhou, C. Barnes, L. Jingwan, Y. Jimei, and L. Hao, “On the continuity of rotation representations in neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- X. Fang, C. R. Garrett, C. Eppner, T. Lozano-Pérez, L. P. Kaelbling, and D. Fox, “DiMSam: Diffusion Models as Samplers for Task and Motion Planning Under Partial Observability,” in IEEE International Conference on Robotics and Automation, 2024.
- S. M. LaValle and J. J. Kuffner Jr, “Randomized Kinodynamic Planning,” International Journal of Robotics Research, vol. 20, no. 5, pp. 378–400, 2001.
- T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine, “Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning,” in Conference on Robotic Learning, 2019.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021.
- J. C. Trinkle and J. J. Hunter, “A Framework for Planning Dexterous Manipulation,” in IEEE International Conference on Robotics and Automation, 1991.
- X. Ji and J. Xiao, “Planning motions compliant to complex contact states,” International Journal of Robotics Research, vol. 20, no. 6, pp. 446–465, 2001.
- G. Lee, T. Lozano-Pérez, and L. P. Kaelbling, “Hierarchical Planning for Multi-Contact Non-Prehensile Manipulation,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2015.
- X. Cheng, E. Huang, Y. Hou, and M. T. Mason, “Contact Mode Guided Motion Planning for Quasidynamic Dexterous Manipulation in 3D,” in IEEE International Conference on Robotics and Automation, 2022.
- C. Diuk, A. Cohen, and M. L. Littman, “An Object-Oriented Representation for Efficient Reinforcement Learning,” in International Conference on Machine Learning, 2008.
- C. Devin, P. Abbeel, T. Darrell, and S. Levine, “Deep object-centric representations for generalizable robot learning,” in IEEE International Conference on Robotics and Automation, 2018.
- R. Veerapaneni, J. D. Co-Reyes, M. Chang, M. Janner, C. Finn, J. Wu, J. Tenenbaum, and S. Levine, “Entity Abstraction in Visual Model-Based Reinforcement Learning,” in Conference on Robotic Learning, 2020.
- C. Wang, R. Wang, A. Mandlekar, L. Fei-Fei, S. Savarese, and D. Xu, “Generalization through hand-eye coordination: An action space for learning spatially-invariant visuomotor control,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021.
- W. Yuan, C. Paxton, K. Desingh, and D. Fox, “SORNet: Spatial Object-Centric Representations for Sequential Manipulation,” in Conference on Robotic Learning, 2021.
- J. Mao, T. Lozano-Pérez, J. Tenenbaum, and L. Kaelbling, “PDSketch: Integrated Domain Programming, Learning, and Planning,” in Neural Information Processing Systems, 2022.
- J. Aleotti and S. Caselli, “Manipulation Planning of Similar Objects by Part Correspondence,” in IEEE International Conference on Robotics and Automation, 2011.
- N. Vahrenkamp, L. Westkamp, N. Yamanobe, E. E. Aksoy, and T. Asfour, “Part-based grasp planning for familiar objects,” in IEEE-RAS International Conference on Humanoid Robots, 2016.
- A. Myers, C. L. Teo, C. Fermüller, and Y. Aloimonos, “Affordance detection of tool parts from geometric features,” in IEEE International Conference on Robotics and Automation, 2015.
- W. Liu, J. Mao, J. Hsu, T. Hermans, A. Garg, and J. Wu, “Composable Part-Based Manipulation,” in Conference on Robotic Learning, 2023.
- L. Manuelli, W. Gao, P. R. Florence, and R. Tedrake, “KPAM: KeyPoint Affordances for Category-Level Robotic Manipulation,” in International Symposium of Robotics Research, 2019.
- Z. Qin, K. Fang, Y. Zhu, L. Fei-Fei, and S. Savarese, “Keto: Learning keypoint representations for tool manipulation,” in IEEE International Conference on Robotics and Automation, 2020.
- D. Turpin, L. Wang, S. Tsogkas, S. J. Dickinson, and A. Garg, “GIFT: generalizable interaction-aware functional tool affordances without labels,” in Robotics: Science and Systems, 2021.
- B. Wen, W. Lian, K. E. Bekris, and S. Schaal, “You Only Demonstrate Once: Category-Level Manipulation from Single Visual Demonstration,” in Robotics: Science and Systems, 2022.
- A. Simeonov, Y. Du, A. Tagliasacchi, J. B. Tenenbaum, A. Rodriguez, P. Agrawal, and V. Sitzmann, “Neural Descriptor Fields: SE(3)-Equivariant Object Representations for Manipulation,” in IEEE International Conference on Robotics and Automation, 2022.
- W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Distilled feature fields enable few-shot language-guided manipulation,” in Conference on Robotic Learning, 2023.
- S. Niekum, S. Osentoski, G. Konidaris, and A. G. Barto, “Learning and generalization of complex tasks from unstructured demonstrations,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.
- Y. Wang, T.-H. Wang, J. Mao, M. Hagenow, and J. Shah, “Grounding language plans in demonstrations through counterfactual perturbations,” in International Conference on Learning Representations, 2024.
- C. R. Garrett, R. Chitnis, R. Holladay, B. Kim, T. Silver, L. P. Kaelbling, and T. Lozano-Pérez, “Integrated task and motion planning,” Annual review of control, robotics, and autonomous systems, vol. 4, no. 1, pp. 265–293, 2021.
- Z. Lai, S. Purushwalkam, and A. Gupta, “The functional correspondence problem,” in IEEE International Conference on Computer Vision, 2021.
- H. Asada and M. Brady, “The curvature primal sketch,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 1, pp. 2–14, 1986.
- S. Brandi, O. Kroemer, and J. Peters, “Generalizing pouring actions between objects using warped parameters,” in IEEE-RAS International Conference on Humanoid Robots, 2014.
- D. Rodriguez and S. Behnke, “Transferring category-based functional grasping skills by latent space non-rigid registration,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2662–2669, 2018.
- O. Biza, S. Thompson, K. R. Pagidi, A. Kumar, E. van der Pol, R. Walters, T. Kipf, J. van de Meent, L. L. S. Wong, and R. Platt, “One-shot imitation learning via interaction warping,” in Conference on Robotic Learning, 2023.
- S. Thompson, L. P. Kaelbling, and T. Lozano-Pérez, “Shape-Based Transfer of Generic Skills,” in IEEE International Conference on Robotics and Automation, 2021.
- M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, and A. Zeng, “Do as i can and not as i say: Grounding language in robotic affordances,” in arXiv:2204.01691, 2022.
- J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in IEEE International Conference on Robotics and Automation, 2023.
- W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer: Composable 3d value maps for robotic manipulation with language models,” in Conference on Robotic Learning, 2023.
- Y. Hu, F. Lin, T. Zhang, L. Yi, and Y. Gao, “Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning,” arXiv:2311.17842, 2023.
- K. Fang, F. Liu, P. Abbeel, and S. Levine, “Moka: Open-world robotic manipulation through mark-based visual prompting,” Robotics: Science and Systems, 2024.
- Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y. Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” in International Conference on Learning Representations, 2024.
- G. Liu, A. Adhikari, A.-m. Farahmand, and P. Poupart, “Learning Object-Oriented Dynamics for Planning from Text,” in International Conference on Learning Representations, 2021.