Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots (2402.10329v3)

Published 15 Feb 2024 in cs.RO

Abstract: We present Universal Manipulation Interface (UMI) -- a data collection and policy learning framework that allows direct skill transfer from in-the-wild human demonstrations to deployable robot policies. UMI employs hand-held grippers coupled with careful interface design to enable portable, low-cost, and information-rich data collection for challenging bimanual and dynamic manipulation demonstrations. To facilitate deployable policy learning, UMI incorporates a carefully designed policy interface with inference-time latency matching and a relative-trajectory action representation. The resulting learned policies are hardware-agnostic and deployable across multiple robot platforms. Equipped with these features, UMI framework unlocks new robot manipulation capabilities, allowing zero-shot generalizable dynamic, bimanual, precise, and long-horizon behaviors, by only changing the training data for each task. We demonstrate UMI's versatility and efficacy with comprehensive real-world experiments, where policies learned via UMI zero-shot generalize to novel environments and objects when trained on diverse human demonstrations. UMI's hardware and software system is open-sourced at https://umi-gripper.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Human-to-robot imitation in the wild. In Proceedings of Robotics: Science and Systems (RSS), 2022.
  2. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023.
  3. Rt-1: Robotics transformer for real-world control at scale. In Proceedings of Robotics: Science and Systems (RSS), 2023.
  4. Humanoid robot teleoperation with vibrotactile based balancing feedback. In Haptics: Neuroscience, Devices, Modeling, and Applications: 9th International Conference, EuroHaptics 2014, Versailles, France, June 24-26, 2014, Proceedings, Part II 9, pages 266–275. Springer, 2014.
  5. The ycb object and model set: Towards common benchmarks for manipulation research. In 2015 International Conference on Advanced Robotics (ICAR), pages 510–517, 2015. doi: 10.1109/ICAR.2015.7251504.
  6. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021a.
  7. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021b. doi: 10.1109/TRO.2021.3075644.
  8. Learning generalizable robotic reward functions from “in-the-wild” human videos. In Proceedings of Robotics: Science and Systems (RSS), 2021.
  9. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023.
  10. On hand-held grippers and the morphological gap in human manipulation demonstration. arXiv preprint arXiv:2311.01832, 2023.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  12. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. In Proceedings of Robotics: Science and Systems (RSS), 2022.
  13. Low-cost exoskeletons for learning whole-arm manipulation in the wild. arXiv preprint arXiv:2309.14975, 2023.
  14. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024.
  15. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognition, 47(6):2280–2292, 2014. ISSN 0031-3203. doi: https://doi.org/10.1016/j.patcog.2014.01.005. URL https://www.sciencedirect.com/science/article/pii/S0031320314000235.
  16. Deep residual learning for image recognition. corr abs/1512.03385 (2015), 2015.
  17. GoPro Inc. Gpmf introuction: Parser for gpmf™ formatted telemetry data used within gopro® cameras. https://gopro.github.io/gpmf-parser/. Accesssed: 2023-01-31.
  18. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning (CoRL), volume 164, pages 991–1002. PMLR, 2022.
  19. Giving robots a hand: Broadening generalization via hand-centric human video demonstrations. In Deep Reinforcement Learning Workshop NeurIPS, 2022.
  20. VIP: Towards universal visual reward and representation via value-implicit pre-training. In The Eleventh International Conference on Learning Representations, 2023.
  21. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning (CoRL), volume 87, pages 879–893. PMLR, 2018.
  22. R3m: A universal visual representation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 892–909. PMLR, 2022.
  23. Tax-pose: Task-specific cross-pose estimation for robot manipulation. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 1783–1792. PMLR, 2023.
  24. The surprising effectiveness of representation learning for visual imitation. In Proceedings of Robotics: Science and Systems (RSS), 2022.
  25. Learning of compliant human–robot interaction using full-body haptic interface. Advanced Robotics, 27(13):1003–1012, 2013.
  26. Characterizing input methods for human-to-robot demonstrations. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 344–353. IEEE, 2019.
  27. Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, pages 570–587. Springer, 2022.
  28. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  29. Recent advances in robot learning from demonstration. Annual review of control, robotics, and autonomous systems, 3:297–330, 2020.
  30. Latent plans for task-agnostic offline reinforcement learning. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 1838–1849. PMLR, 2023.
  31. Scalable. intuitive human to robot skill transfer with wearable human machine interfaces: On complex, dexterous tasks. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6318–6325. IEEE, 2023.
  32. Learning predictive models from observation and interaction. In European Conference on Computer Vision, pages 708–725. Springer, 2020.
  33. Reinforcement learning with videos: Combining offline observations with interaction. In Proceedings of the 2020 Conference on Robot Learning (CoRL), volume 155, pages 339–354. PMLR, 2021.
  34. Deep imitation learning for humanoid loco-manipulation through human teleoperation. In 2023 IEEE-RAS 22nd International Conference on Humanoid Robots (Humanoids), pages 1–8. IEEE, 2023.
  35. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023.
  36. Concept2robot: Learning manipulation concepts from instructions and human demonstrations. The International Journal of Robotics Research, 40(12-14):1419–1434, 2021.
  37. Videodex: Learning dexterity from internet videos. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 654–665. PMLR, 2023.
  38. Distilled feature fields enable few-shot language-guided manipulation. In Proceedings of The 7th Conference on Robot Learning (CoRL), volume 229, pages 405–424. PMLR, 2023.
  39. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In 2022 International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2022.
  40. Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations. Robotics and Automation Letters, 2020.
  41. SEED: Series elastic end effectors in 6d for visuotactile tool use. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4684–4691, 2022. doi: 10.1109/IROS47612.2022.9982092.
  42. A force-sensitive exoskeleton for teleoperation: An application in elderly care robotics. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 12624–12630. IEEE, 2023.
  43. Mimicplay: Long-horizon imitation learning by watching human play. In Proceedings of The 7th Conference on Robot Learning (CoRL), volume 229, pages 201–221. PMLR, 2023.
  44. Error-aware imitation learning from teleoperation data for mobile manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), volume 164, pages 1367–1378. PMLR, 2022.
  45. GELLO: A general, low-cost, and intuitive teleoperation framework for robot manipulators. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL2023, 2023.
  46. Towards a personal robotics development platform: Rationale and design of an intrinsically safe personal robot. In 2008 IEEE International Conference on Robotics and Automation, pages 2165–2170. IEEE, 2008.
  47. Masked visual pre-training for motor control. arXiv:2203.06173, 2022.
  48. Learning by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7827–7834. IEEE, 2021.
  49. Visual imitation made easy. In Conference on Robot Learning (CoRL), volume 155, pages 1992–2005. PMLR, 2021.
  50. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018.
  51. Benefit of large field-of-view cameras for visual odometry. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 801–808, 2016. doi: 10.1109/ICRA.2016.7487210.
  52. Learning fine-grained bimanual manipulation with low-cost hardware. In Proceedings of Robotics: Science and Systems (RSS), 2023.
  53. Viola: Imitation learning for vision-based manipulation with object proposal priors. In Proceedings of The 6th Conference on Robot Learning (CoRL), volume 205, pages 1199–1210. PMLR, 2023.
Citations (86)

Summary

  • The paper introduces UMI, a novel framework that transfers complex human manipulation skills to robots without requiring real-world robots during training.
  • It employs enhanced sensor setups—including fisheye lenses, IMUs, and GoPro cameras—to capture rich visuomotor data for precise, rapid movements.
  • Experimental results show a 70% zero-shot success rate in dynamic tasks, underscoring UMI's robust generalization across diverse environments.

Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots

The paper discusses the Universal Manipulation Interface (UMI), a novel framework that enables the transfer of complex human manipulation skills to robotic systems without requiring real-world robotic counterparts during training. The research addresses the main challenge of skill transfer from human demonstrations to robotic systems, which is pivotal for enhancing robot dexterity in dynamic and unstructured environments.

Summary of Findings

The UMI framework is a sophisticated yet portable, low-cost system that circumvents the limitations of traditional teleoperation and passive video-based demonstrations. It employs hand-held grippers, strategically augmented with fisheye lenses and side mirrors, for in-the-wild data collection. This setup captures a comprehensive suite of sensory data essential for learning robust visuomotor policies. The key advancements by UMI include:

  1. Intuitive and Rich Data Collection: By using fisheye lenses and side mirrors, UMI ensures wide field-of-view observations and initiates stereo depth perception using monocular fisheye imagery. This configuration expands visual context and depth information without the complexity of multiple sensor systems.
  2. Precision in Rapid Movements: The use of GoPro cameras coupled with IMU sensors enhances the ability to capture dynamic tasks accurately. This integration reduces latency during data capture and allows for the recovery of precise metric-scaled actions.
  3. Latency Matching: The innovative policy interface of UMI employs synchronizing observation and execution timelines to hardware-specific latencies, ensuring that fast and dynamic manipulations remain effective during real-time robot deployment.
  4. Hardware-Agnostic Policy Representation: The emphasis on relative trajectory and latent policy representations promotes versatility, allowing policies trained on a particular configuration to be deployed across diverse robotic systems.

Numerical Outcomes and Experimental Results

The experimental evaluations highlighted several tasks, including complex bimanual manipulations and dynamic object sorting, demonstrating UMI's capability to execute diverse action modalities not feasible with existing systems. Notably, the framework exhibited a 70% success rate in zero-shot generalization across novel environments and objects, showcasing impressive generalization capabilities seldom observed in standard behavior cloning. This outcome underscores the effectiveness of UMI in capturing and deploying skills without additional fine-tuning on target robot platforms.

Implications and Future Directions

UMI's open-sourced design promises to broaden the accessibility of robotics research, facilitating collaborative advancements in skill acquisition and cognitive embedding in robotic agents. The clear demonstration of UMI's capability to generalize across environments implies potential applications in household robotics, autonomous manipulation tasks, and unstructured outdoor settings.

The findings evoke several pathways for future research within AI and robotics:

  • Scalability and Ergonomic Improvements: Refining the ergonomic design of UMI grippers to better simulate more degrees-of-freedom could improve human imitation.
  • Enhanced Collaborative Data Collection: By fostering distributed data collection from non-expert users globally, UMI can assist in collating vast datasets essential for training comprehensive, adaptable robotic systems.
  • Integration with Advanced Learning Models: Incorporating UMI data with advanced machine learning models such as multi-task and continual learning could compound the adaptability and robustness of robotic systems.

In conclusion, the Universal Manipulation Interface represents a significant stride towards democratizing robotic skill acquisition, leveraging diverse human demonstrations for robust policy training without real-world robot dependencies. The broad, transferable nature of the collected data positions UMI as a key player in revolutionizing how robots learn and interact in dynamic human environments.

Youtube Logo Streamline Icon: https://streamlinehq.com