Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Generalizable Zero-Shot Manipulation via Translating Human Interaction Plans (2312.00775v1)

Published 1 Dec 2023 in cs.RO, cs.CV, and cs.LG

Abstract: We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. https://homangab.github.io/hopman/

Introduction

In robotics, teaching machines to perform tasks without prior specific training is a tantalizing yet challenging goal. A new system stands out for its ability to handle this challenge by translating human interaction plans into robot actions. This approach builds on the observation that humans can perform a vast array of manipulations which robots could learn to emulate.

The Underlying Approach

At the core of this approach is a two-part system. First, there’s a plan predictor based on human interactions. Given a current and a goal image, this module predicts future hand and object configurations, leading to an interaction plan. The system then features a translation module translating these plans into actions that a robot can perform.

Rather than sourcing data from robots, this model predominantly learns from large-scale human videos available on the web, facilitated by the plan predictor. The translation module, conversely, only requires a small set of training data. The combination allows the system to handle a broad range of tasks and objects without needing additional training time for deployment.

Learning from Humans

One eye-catching aspect of this system is its reliance on visual data to delineate hand and object interaction plans, focusing on motions instead of attempting full image prediction. The methodology employs a diffusion model trained with videos of human interactions to produce likely future mask scenarios representing hand and object movements.

For physical implementation, the translation module is educated through a limited set of paired human-robot data. This training data bucket includes detailed guides to human manipulation, which it maps to robotic movements, then tested in real-world environments.

Experimentation and Generalization

The experiments conducted to assess this framework involved a table-top setup with an arm-like robot as well as in-the-wild manipulations, where the robot operated in unstructured environments like offices and kitchens. With a bank of 16 skills and navigating interactions with 40 different objects, the robot displayed a significant ability to generalize across tasks, showcasing manipulation skills in diverse, unforeseen situations.

The research, equipped with a structured generalization criterion across object categories, instances, skills, and configurations, illuminated how a robot could adeptly learn manipulation skills without on-site training. The system was found to be specially efficient when translating human interactions from video content into robot actions, pushing the envelope for zero-shot manipulation capabilities in robotics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.
  2. S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” arXiv preprint arXiv:2203.12601, 2022.
  3. A. Majumdar, K. Yadav, S. Arnaud, Y. J. Ma, C. Chen, S. Silwal, A. Jain, V.-P. Berges, P. Abbeel, J. Malik, et al., “Where are we in the search for an artificial visual cortex for embodied intelligence?” arXiv preprint arXiv:2303.18240, 2023.
  4. Y. J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V. Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value-implicit pre-training,” arXiv preprint arXiv:2210.00030, 2022.
  5. T. Nagarajan, C. Feichtenhofer, and K. Grauman, “Grounded human-object interaction hotspots from video,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 8688–8697.
  6. M. Goyal, S. Modi, R. Goyal, and S. Gupta, “Human hands as probes for interactive object understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3293–3303.
  7. S. Bahl, A. Gupta, and D. Pathak, “Human-to-robot imitation in the wild,” arXiv preprint arXiv:2207.09450, 2022.
  8. S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,” arXiv preprint arXiv:2304.08488, 2023.
  9. Y. Qin, Y.-H. Wu, S. Liu, H. Jiang, R. Yang, Y. Fu, and X. Wang, “Dexmv: Imitation learning for dexterous manipulation from human videos,” arXiv preprint arXiv:2108.05877, 2021.
  10. K. Shaw, S. Bahl, and D. Pathak, “Videodex: Learning dexterity from internet videos,” in 6th Annual Conference on Robot Learning.
  11. R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al., “The” something something” video database for learning and evaluating visual common sense,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5842–5850.
  12. P. Das, C. Xu, R. F. Doell, and J. J. Corso, “A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2634–2641.
  13. F. De la Torre, J. Hodgins, A. Bargteil, X. Martin, J. Macey, A. Collado, and P. Beltran, “Guide to the carnegie mellon university multimodal activity (cmu-mmac) database,” 2009.
  14. K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al., “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 995–19 012.
  15. D. Shan, J. Geng, M. Shu, and D. Fouhey, “Understanding human hands in contact at internet scale,” in CVPR, 2020.
  16. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al., “Scaling egocentric vision: The epic-kitchens dataset,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 720–736.
  17. Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 619–635.
  18. C. Zimmermann and T. Brox, “Learning to estimate 3d hand pose from single rgb images,” in CVPR, 2017.
  19. U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz, “Hand pose estimation via latent 2.5 d heatmap regression,” in ECCV, 2018.
  20. A. Spurr, J. Song, S. Park, and O. Hilliges, “Cross-modal deep variational hand pose estimation,” in CVPR, 2018.
  21. L. Ge, Z. Ren, Y. Li, Z. Xue, Y. Wang, J. Cai, and J. Yuan, “3d hand shape and pose estimation from a single rgb image,” in CVPR, 2019.
  22. S. Baek, K. I. Kim, and T.-K. Kim, “Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering,” in CVPR, 2019.
  23. A. Boukhayma, R. d. Bem, and P. H. Torr, “3d hand shape and pose from images in the wild,” in CVPR, 2019.
  24. Y. Hasson, G. Varol, D. Tzionas, I. Kalevatykh, M. J. Black, I. Laptev, and C. Schmid, “Learning joint reconstruction of hands and manipulated objects,” in CVPR, 2019.
  25. D. Kulon, R. A. Guler, I. Kokkinos, M. M. Bronstein, and S. Zafeiriou, “Weakly-supervised mesh-convolutional hand reconstruction in the wild,” in CVPR, 2020.
  26. S. Liu, H. Jiang, J. Xu, S. Liu, and X. Wang, “Semi-supervised 3d hand-object poses estimation with interactions in time,” in CVPR, 2021.
  27. Y. Rong, T. Shiratori, and H. Joo, “Frankmocap: Fast monocular 3d hand and body motion capture by regression and integration,” arXiv preprint arXiv:2008.08324, 2020.
  28. W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab, “Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again,” ICCV, 2017.
  29. M. Rad and V. Lepetit, “Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth,” ICCV, 2017.
  30. Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” arXiv, 2018.
  31. Y. Hu, J. Hugonot, P. Fua, and M. Salzmann, “Segmentation-driven 6d object pose estimation,” CVPR, 2019.
  32. Y. He, W. Sun, H. Huang, J. Liu, H. Fan, and J. Sun, “Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation,” CVPR, 2020.
  33. S. Liu, S. Tripathi, S. Majumdar, and X. Wang, “Joint hand motion and interaction hotspots prediction from egocentric videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3282–3292.
  34. K. Mo, L. J. Guibas, M. Mukadam, A. Gupta, and S. Tulsiani, “Where2act: From pixels to actions for articulated 3d objects,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6813–6823.
  35. S. Brahmbhatt, A. Handa, J. Hays, and D. Fox, “Contactgrasp: Functional multi-finger grasp synthesis from contact,” arXiv, 2019.
  36. F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the ieee conference on computer vision and pattern recognition, 2015, pp. 961–970.
  37. S. Tan, T. Nagarajan, and K. Grauman, “Egodistill: Egocentric head motion distillation for efficient video understanding,” arXiv preprint arXiv:2301.02217, 2023.
  38. C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine, “One-shot visual imitation learning via meta-learning,” in Conference on robot learning.   PMLR, 2017, pp. 357–368.
  39. S. Young, D. Gandhi, S. Tulsiani, A. Gupta, P. Abbeel, and L. Pinto, “Visual imitation made easy,” in Conference on Robot Learning (CoRL), 2020.
  40. A. Mandlekar, Y. Zhu, A. Garg, J. Booher, M. Spero, A. Tung, J. Gao, J. Emmons, A. Gupta, E. Orbay, et al., “Roboturk: A crowdsourcing platform for robotic skill learning through imitation,” in Conference on Robot Learning.   PMLR, 2018, pp. 879–893.
  41. S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta, “The unsurprising effectiveness of pre-trained vision models for control,” arXiv preprint arXiv:2203.03580, 2022.
  42. T. Xiao, I. Radosavovic, T. Darrell, and J. Malik, “Masked visual pre-training for motor control,” arXiv preprint arXiv:2203.06173, 2022.
  43. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  44. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  45. C. Wang, L. Fan, J. Sun, R. Zhang, L. Fei-Fei, D. Xu, Y. Zhu, and A. Anandkumar, “Mimicplay: Long-horizon imitation learning by watching human play,” arXiv preprint arXiv:2302.12422, 2023.
  46. L. Smith, N. Dhawan, M. Zhang, P. Abbeel, and S. Levine, “Avid: Learning multi-stage tasks via pixel-level translation of human videos,” arXiv, 2019.
  47. H. Xiong, Q. Li, Y.-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg, “Learning by watching: Physical imitation of manipulation skills from human videos,” arXiv, 2021.
  48. H. Bharadhwaj, A. Gupta, S. Tulsiani, and V. Kumar, “Zero-shot robot manipulation from passive human videos,” arXiv preprint arXiv:2302.02011, 2023.
  49. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 684–10 695.
  50. V. Voleti, A. Jolicoeur-Martineau, and C. Pal, “Masked conditional video diffusion for prediction, generation, and interpolation,” arXiv preprint arXiv:2205.09853, 2022.
  51. T. Z. Zhao, V. Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” arXiv preprint arXiv:2304.13705, 2023.
  52. Y. Ye, X. Li, A. Gupta, S. De Mello, S. Birchfield, J. Song, S. Tulsiani, and S. Liu, “Affordance diffusion: Synthesizing hand-object interactions,” in CVPR, 2023, pp. 22 479–22 489.
  53. A. Darkhalil, D. Shan, B. Zhu, J. Ma, A. Kar, R. Higgins, S. Fidler, D. Fouhey, and D. Damen, “Epic-kitchens visor benchmark: Video segmentations and object relations,” Advances in Neural Information Processing Systems, vol. 35, pp. 13 745–13 758, 2022.
  54. L. Zhang, S. Zhou, S. Stent, and J. Shi, “Fine-grained egocentric hand-object segmentation: Dataset, model, and applications,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIX.   Springer, 2022, pp. 127–145.
  55. V. Kumar and E. Todorov, “Mujoco haptix: A virtual reality system for hand manipulation,” in 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).   IEEE, 2015, pp. 657–663.
  56. A. Stone, T. Xiao, Y. Lu, K. Gopalakrishnan, K.-H. Lee, Q. Vuong, P. Wohlhart, B. Zitkovich, F. Xia, C. Finn, et al., “Open-world object manipulation using pre-trained vision-language models,” arXiv preprint arXiv:2303.00905, 2023.
  57. Z. Mandi, H. Bharadhwaj, V. Moens, S. Song, A. Rajeswaran, and V. Kumar, “Cacti: A framework for scalable multi-task multi-scene visual imitation learning,” arXiv preprint arXiv:2212.05711, 2022.
  58. S. Haldar, V. Mathur, D. Yarats, and L. Pinto, “Watch and match: Supercharging imitation with regularized optimal transport,” in Conference on Robot Learning.   PMLR, 2023, pp. 32–43.
  59. Z. J. Cui, Y. Wang, N. Muhammad, L. Pinto, et al., “From play to policy: Conditional behavior generation from uncurated robot data,” arXiv preprint arXiv:2210.10047, 2022.
  60. Z. Chen, S. Kiami, A. Gupta, and V. Kumar, “Genaug: Retargeting behaviors to unseen situations via generative augmentation,” arXiv preprint arXiv:2302.06671, 2023.
  61. T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, J. Peralta, B. Ichter, et al., “Scaling robot learning with semantically imagined experience,” arXiv preprint arXiv:2302.11550, 2023.
  62. H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tulsiani, and V. Kumar, “Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,” arXiv preprint arXiv:2309.01918, 2023.
  63. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Homanga Bharadhwaj (36 papers)
  2. Abhinav Gupta (178 papers)
  3. Vikash Kumar (70 papers)
  4. Shubham Tulsiani (71 papers)
Citations (24)
Github Logo Streamline Icon: https://streamlinehq.com