Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DITTO: Demonstration Imitation by Trajectory Transformation (2403.15203v2)

Published 22 Mar 2024 in cs.RO and cs.CV

Abstract: Teaching robots new skills quickly and conveniently is crucial for the broader adoption of robotic systems. In this work, we address the problem of one-shot imitation from a single human demonstration, given by an RGB-D video recording. We propose a two-stage process. In the first stage we extract the demonstration trajectory offline. This entails segmenting manipulated objects and determining their relative motion in relation to secondary objects such as containers. In the online trajectory generation stage, we first re-detect all objects, then warp the demonstration trajectory to the current scene and execute it on the robot. To complete these steps, our method leverages several ancillary models, including those for segmentation, relative object pose estimation, and grasp prediction. We systematically evaluate different combinations of correspondence and re-detection methods to validate our design decision across a diverse range of tasks. Specifically, we collect and quantitatively test on demonstrations of ten different tasks including pick-and-place tasks as well as articulated object manipulation. Finally, we perform extensive evaluations on a real robot system to demonstrate the effectiveness and utility of our approach in real-world scenarios. We make the code publicly available at http://ditto.cs.uni-freiburg.de.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. C. Celemin, R. Pérez-Dattari, E. Chisari, G. Franzese, L. de Souza Rosa, R. Prakash, Z. Ajanović, M. Ferraz, A. Valada, J. Kober, et al., “Interactive imitation learning in robotics: A survey,” Foundations and Trends® in Robotics, vol. 10, no. 1-2, pp. 1–197, 2022.
  2. J. Zhao, A. Giammarino, E. Lamon, J. M. Gandarias, E. D. Momi, and A. Ajoudani, “A hybrid learning and optimization framework to achieve physically interactive tasks with mobile manipulators,” IEEE Robotics and Automation Letters, vol. 7, no. 3, pp. 8036–8043, 2022.
  3. J. Zhao, A. Giammarino, E. Lamon, J. M. Gandarias, E. De Momi, and A. Ajoudani, “A hybrid learning and optimization framework to achieve physically interactive tasks with mobile manipulators,” IEEE Rob. and Auto. Let., vol. 7, no. 3, pp. 8036–8043, 2022.
  4. Y. Zhu, A. Joshi, P. Stone, and Y. Zhu, “Viola: Imitation learning for vision-based manipulation with object proposal priors,” Conf. on Robot Learning, 2022.
  5. E. Chisari, T. Welschehold, J. Boedecker, W. Burgard, and A. Valada, “Correct me if i am wrong: Interactive learning for robotic manipulation,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3695–3702, 2022.
  6. P. Vitiello, K. Dreczkowski, and E. Johns, “One-shot imitation learning: A pose estimation perspective,” in Conf. on Robot Learning, 2023.
  7. T. Cheng, D. Shan, A. S. Hassen, R. E. L. Higgins, and D. Fouhey, “Towards a richer 2d understanding of hands at scale,” in Proc. Adv. Neural Inform. Process. Syst., 2023.
  8. J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 8918–8927, 2021.
  9. W. Si, N. Wang, and C. Yang, “A review on manipulation skill acquisition through teleoperation-based learning from demonstration,” Cognitive Computation and Systems, vol. 3, no. 1, pp. 1–16, 2021.
  10. J. Gao, Z. Tao, N. Jaquier, and T. Asfour, “K-vil: Keypoints-based visual imitation learning,” IEEE Trans. on Robotics, vol. 39, pp. 3888–3908, 2022.
  11. M. Vecerík, C. Doersch, Y. Yang, T. Davchev, Y. Aytar, G. Zhou, R. Hadsell, L. de Agapito, and J. Scholz, “Robotap: Tracking arbitrary points for few-shot visual imitation,” arXiv preprint arXiv:2308.15975, 2023.
  12. J.-F. Yeh, C.-M. Chung, H.-T. Su, Y.-T. Chen, and W. H. Hsu, “Stage conscious attention network (scan) : A demonstration-conditioned policy for few-shot imitation,” arXiv preprint arXiv:2112.02278, 2021.
  13. L. Wang, J. Zhao, Y. Du, E. H. Adelson, and R. Tedrake, “Poco: Policy composition from and for heterogeneous robot learning,” arXiv preprint arXiv:2402.02511, 2024.
  14. M. Argus, L. Hermann, J. Long, and T. Brox, “Flowcontrol: Optical flow based visual servoing,” Proc. IEEE Int. Conf. on Intel. Rob. and Syst., pp. 7534–7541, 2020.
  15. Y. Zhu, Z. Jiang, P. Stone, and Y. Zhu, “Learning generalizable manipulation policies with object-centric 3d representations,” in Conf. on Robot Learning, 2023.
  16. N. D. Palo and E. Johns, “Dinobot: Robot manipulation via retrieval and alignment with vision foundation models,” arXiv preprint arXiv:2402.13181, 2024.
  17. S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” in Conf. on Robot Learning, 2022.
  18. S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 01–13, 2023.
  19. C. Bhateja, D. Guo, D. Ghosh, A. Singh, M. Tomar, Q. H. Vuong, Y. Chebotar, S. Levine, and A. Kumar, “Robotic offline rl from internet videos via value-function pre-training,” arXiv preprint arXiv:2309.13041, 2023.
  20. M. J. Kim, J. Wu, and C. Finn, “Giving robots a hand: Learning generalizable manipulation with eye-in-hand human video demonstrations,” arXiv preprint arXiv:2307.05959, 2023.
  21. S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” in Proc. Rob.: Sci. and Syst., 2022.
  22. H. Bharadhwaj, A. Gupta, V. Kumar, and S. Tulsiani, “Towards generalizable zero-shot manipulation via translating human interaction plans,” arXiv preprint arXiv:2312.00775, 2023.
  23. P.-C. Ko, J. Mao, Y. Du, S.-H. Sun, and J. B. Tenenbaum, “Learning to Act from Actionless Video through Dense Correspondences,” arXiv preprint arXiv:2310.08576, 2023.
  24. H. Bharadhwaj, A. Gupta, S. Tulsiani, and V. Kumar, “Zero-shot robot manipulation from passive human videos,” arXiv preprint arXiv:2302.02011, 2023.
  25. N. Kyriazis and A. A. Argyros, “Tracking of hands interacting with several objects,” 2015.
  26. J. Yang, C. Deng, J. Wu, R. Antonova, L. J. Guibas, and J. Bohg, “Equivact: Sim(3)-equivariant visuomotor policies beyond rigid object manipulation,” arXiv preprint arXiv:2310.16050, 2023.
  27. C. Zimmermann, T. Welschehold, C. Dornhege, W. Burgard, and T. Brox, “3d human pose estimation in rgbd images for robotic task learning,” Proc. IEEE Int. Conf. on Rob. and Auto., pp. 1986–1992, 2018.
  28. T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine, “One-shot imitation from observing humans via domain-adaptive meta-learning,” arXiv preprint arXiv:1802.01557, 2018.
  29. L. Pauly, W. C. Agboh, D. C. Hogg, and R. Fuentes, “O2a: One-shot observational learning with action vectors,” Frontiers in Robotics and AI, vol. 8, 2018.
  30. S. Dasari and A. K. Gupta, “Transformers for one-shot visual imitation,” arXiv preprint arXiv:2011.05970, 2020.
  31. D. Guo, “Learning multi-step manipulation tasks from a single human demonstration,” arXiv preprint arXiv:2312.15346, 2023.
  32. J. V. Hurtado and A. Valada, “Semantic scene segmentation for robotics,” in Deep learning for robot perception and cognition, pp. 279–311, Elsevier, 2022.
  33. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár, and R. B. Girshick, “Segment anything,” Proc. Int. Conf. Comput. Vis., pp. 3992–4003, 2023.
  34. M. Käppeler, K. Petek, N. Vödisch, W. Burgard, and A. Valada, “Few-shot panoptic segmentation with foundation models,” arXiv preprint arXiv:2309.10726, 2023.
  35. D. Shan, J. Geng, M. Shu, and D. F. Fouhey, “Understanding human hands in contact at internet scale,” Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 9866–9875, 2020.
  36. K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 18973–18990, 2021.
  37. C. Zhang, C. Fu, S. Wang, N. Agarwal, K. Lee, C. Choi, and C. Sun, “Object-centric video representation for long-term action anticipation,” arXiv preprint arXiv:2311.00180, 2023.
  38. D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. Journal of Computer Vision, vol. 60, pp. 91–110, 2004.
  39. E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski, “Orb: An efficient alternative to sift or surf,” Proc. Int. Conf. Comput. Vis., pp. 2564–2571, 2011.
  40. P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 4937–4946, 2019.
  41. Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in European Conf. on Computer Vision, 2020.
  42. H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and A. Geiger, “Unifying flow, stereo and depth estimation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, 2023.
  43. V. N. Nguyen, T. Hodan, G. Ponimatkin, T. Groueix, and V. Lepetit, “Cnos: A strong baseline for cad-based novel object segmentation,” IEEE/CVF Int. Conf. on Computer Vision Workshops, pp. 2126–2132, 2023.
  44. M. Oquab, T. Darcet, T. Moutakanni, H. Q. Vo, M. Szafraniec, et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
  45. E. Valassakis, G. Papagiannis, N. D. Palo, and E. Johns, “Demonstrate once, imitate immediately (dome): Learning visual servoing for one-shot imitation learning,” Proc. IEEE Int. Conf. on Intel. Rob. and Syst., pp. 8614–8621, 2022.
  46. W.-J. Baek, C. Pohl, P. Pelcz, T. Kröger, and T. Asfour, “Improving humanoid grasp success rate based on uncertainty-aware metrics and sensitivity optimization,” IEEE-RAS Int. Conf. on Humanoid Robots (Humanoids), pp. 786–793, 2022.
  47. M. A. Roa, M. J. Argus, D. Leidner, C. W. Borst, and G. Hirzinger, “Power grasp planning for anthropomorphic robot hands,” Proc. IEEE Int. Conf. on Rob. and Auto., pp. 563–569, 2012.
  48. M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox, “Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes,” Proc. IEEE Int. Conf. on Rob. and Auto., pp. 13438–13444, 2021.
  49. H. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y. Xie, and C. Lu, “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,” IEEE Trans. on Robotics, vol. 39, pp. 3929–3945, 2022.
  50. P. Piacenza, J. Yuan, J. Huh, and V. Isler, “Vfas-grasp: Closed loop grasping with visual feedback and adaptive sampling,” arXiv preprint arXiv:2310.18459, 2023.
  51. E. Chisari, N. Heppert, T. Welschehold, W. Burgard, and A. Valada, “Centergrasp: Object-aware implicit representation learning for simultaneous shape reconstruction and 6-dof grasp estimation,” arXiv preprint arXiv:2312.08240, 2023.
  52. K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-squares fitting of two 3-d point sets,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. PAMI-9, pp. 698–700, 1987.
  53. M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Commun. ACM, vol. 24, pp. 381–395, 1981.
  54. K. Shoemake, “Animating rotation with quaternion curves,” Proc. of the annual conf. on Computer graphics and interactive techniques, 1985.
  55. M. Mittal, D. Hoeller, F. Farshidian, M. Hutter, and A. Garg, “Articulated object interaction in unknown scenes with whole-body mobile manipulation,” in Proc. IEEE Int. Conf. on Intel. Rob. and Syst., pp. 1647–1654, 2022.
Citations (8)

Summary

We haven't generated a summary for this paper yet.