Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset (2403.12945v1)

Published 19 Mar 2024 in cs.RO

Abstract: The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.

DROID: Unveiling a Large-Scale In-The-Wild Robot Manipulation Dataset

Introduction to DROID

The DROID (Distributed Robot Interaction Dataset) initiative represents a significant leap forward in the development of large-scale, diverse datasets aimed at advancing robotic manipulation research. This dataset encompasses 76k demonstration trajectories, or 350 hours of interaction data, collected across 564 scenes and 86 tasks by 50 data collectors in various geographical locations including North America, Asia, and Europe, over a period of 12 months. Each data entry in DROID is enriched with synchronized RGB camera streams, camera calibration data, depth information, and natural language instructions, providing a comprehensive resource for developing and testing robotic manipulation policies.

Dataset Collection and Composition

The DROID dataset is the product of a collaborative effort involving 13 institutions, using a standardized robotics hardware setup based on the Franka Panda robot arm. The consistency in hardware across diverse collection sites has been instrumental in accumulating a large and varied dataset. Protocols for data collection were meticulously designed to enhance diversity, encourage the registration of diverse tasks, and facilitate the easy adjustment of scenes. Upon completion, all trajectories were subject to post-processing, including natural language annotation via crowdsourcing, ensuring high-quality and usable data for research purposes.

Overview of Data Diversity

A distinguishing feature of the DROID dataset is its unparalleled diversity across multiple dimensions crucial for robotic manipulation research: tasks, objects, scenes, viewpoints, and interaction points. The dataset exhibits a wide range of manipulation tasks, as denoted by the varied verb usage in the instructions accompanying the trajectories. It also features interactions with a broad array of everyday objects across numerous scenes that include industrial and home environments, offices, kitchens, dining and bedrooms, among others. Furthermore, the data spans a multitude of camera viewpoints and interaction locations, providing a rich resource for developing policies capable of generalizing across different settings.

Experimental Validation

Experimental evaluation of policies trained using the DROID dataset demonstrates significant improvements in performance and robustness across a variety of tasks and conditions. Policies co-trained with DROID outperform those trained solely on in-domain data or with other existing large-scale datasets. The diversity inherent in DROID, especially in terms of scene variety, plays a crucial role in these advancements. Tasks for evaluation were selected to cover a wide spectrum of real robot usage scenarios, ranging from simple manipulation tasks to more complex, multi-step processes. The resultant policies showed marked improvements in handling both in-distribution and out-of-distribution (OOD) variations, highlighting DROID's efficacy in enhancing policy generalization.

Concluding Remarks and Future Directions

The introduction of DROID represents a milestone in robotic manipulation research, offering a dataset of unprecedented scale and diversity. Its successful application in improving policy performance and robustness signals the dataset's potential as a cornerstone for future research endeavors in the field. The detailed documentation, open-sourced data, and adaptable hardware platform accompanying DROID further its accessibility and utility to the broader research community. Future explorations could delve into optimizing the utilization of this diverse dataset, investigating novel learning paradigms, and expanding the dataset to encompass even wider scenarios and tasks.

The DROID dataset, through its comprehensive design and demonstrated utility, opens new avenues for the development of robust, generalizable robotic manipulation policies, marking a significant step forward in the quest towards versatile and adaptable robotic systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023.
  2. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  3. Scaling data-driven robotics with reward sketching and batch reinforcement learning. RSS, 2019.
  4. nuscenes: A multimodal dataset for autonomous driving. preprint arXiv:1903.11027, 2019.
  5. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  6. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023.
  7. Robonet: Large-scale multi-robot learning. CoRL, 2019.
  8. Objaverse-xl: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023a.
  9. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023b.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  11. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control. arXiv:1812.00568, 2018.
  12. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021.
  13. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 3:5, 2023.
  14. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024.
  15. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  16. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
  17. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  18. Robot learning in homes: Improving generalization and reducing dataset bias. Advances in neural information processing systems, 31, 2018.
  19. Scaling up and distilling down: Language-guided robot skill acquisition. In Conference on Robot Learning, pages 3766–3777. PMLR, 2023.
  20. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  21. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
  22. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
  23. VIMA: Robot manipulation with multimodal prompts. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 14975–15022. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/jiang23b.html.
  24. QT-Opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293, 2018.
  25. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv, 2021.
  26. Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters, 7(4):11807–11814, 2022.
  27. Learning hand-eye coordination for robotic grasping with large-scale data collection. In International Symposium on Experimental Robotics. Springer, 2016.
  28. Polymetis. https://facebookresearch.github.io/fairo/polymetis/, 2021.
  29. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters, 2023.
  30. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pages 879–893. PMLR, 2018.
  31. What matters in learning from offline human demonstrations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021.
  32. Octo: An open-source generalist robot policy. https://octo-models.github.io, 2023.
  33. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023.
  34. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), pages 3406–3413. IEEE, 2016.
  35. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, 2023.
  36. Ahad Rana. Common crawl – building an open web-scale crawl using hadoop, 2010. URL https://www.slideshare.net/hadoopusergroup/common-crawlpresentation.
  37. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, 2019. URL https://api.semanticscholar.org/CorpusID:203626972.
  38. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  39. On bringing robots home, 2023.
  40. Gnm: A general navigation model to drive any robot. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7226–7233. IEEE, 2023a.
  41. ViNT: A foundation model for visual navigation. In 7th Annual Conference on Robot Learning, 2023b. URL https://arxiv.org/abs/2306.14846.
  42. Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning, pages 906–915. PMLR, 2018.
  43. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
  44. Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations. IEEE Robotics and Automation Letters, 5(3):4978–4985, 2020.
  45. Nomad: Goal masked diffusion policies for navigation and exploration. arXiv preprint arXiv:2310.07896, 2023.
  46. Scalability in perception for autonomous driving: Waymo open dataset, 2019.
  47. Tartandrive: A large-scale dataset for learning off-road dynamics models. In 2022 International Conference on Robotics and Automation (ICRA), pages 2546–2552. IEEE, 2022.
  48. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  49. Bridgedata v2: A dataset for robot learning at scale, 2023.
  50. Visual imitation made easy, 2020.
  51. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  52. Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
  53. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In 7th Annual Conference on Robot Learning, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (99)
  1. Alexander Khazatsky (9 papers)
  2. Karl Pertsch (35 papers)
  3. Suraj Nair (39 papers)
  4. Ashwin Balakrishna (40 papers)
  5. Sudeep Dasari (19 papers)
  6. Siddharth Karamcheti (26 papers)
  7. Soroush Nasiriany (17 papers)
  8. Mohan Kumar Srirama (10 papers)
  9. Lawrence Yunliang Chen (15 papers)
  10. Kirsty Ellis (4 papers)
  11. Peter David Fagan (5 papers)
  12. Joey Hejna (19 papers)
  13. Masha Itkina (22 papers)
  14. Marion Lepert (8 papers)
  15. Yecheng Jason Ma (21 papers)
  16. Patrick Tree Miller (1 paper)
  17. Jimmy Wu (21 papers)
  18. Suneel Belkhale (18 papers)
  19. Shivin Dass (8 papers)
  20. Huy Ha (13 papers)
Citations (92)
Github Logo Streamline Icon: https://streamlinehq.com