Data Scaling Laws in Imitation Learning for Robotic Manipulation (2410.18647v1)
Abstract: Data scaling has revolutionized fields like natural language processing and computer vision, providing models with remarkable generalization capabilities. In this paper, we investigate whether similar data scaling laws exist in robotics, particularly in robotic manipulation, and whether appropriate data scaling can yield single-task robot policies that can be deployed zero-shot for any object within the same category in any environment. To this end, we conduct a comprehensive empirical study on data scaling in imitation learning. By collecting data across numerous environments and objects, we study how a policy's generalization performance changes with the number of training environments, objects, and demonstrations. Throughout our research, we collect over 40,000 demonstrations and execute more than 15,000 real-world robot rollouts under a rigorous evaluation protocol. Our findings reveal several intriguing results: the generalization performance of the policy follows a roughly power-law relationship with the number of environments and objects. The diversity of environments and objects is far more important than the absolute number of demonstrations; once the number of demonstrations per environment or object reaches a certain threshold, additional demonstrations have minimal effect. Based on these insights, we propose an efficient data collection strategy. With four data collectors working for one afternoon, we collect sufficient data to enable the policies for two tasks to achieve approximately 90% success rates in novel environments with unseen objects.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Human-to-robot imitation in the wild. arXiv preprint arXiv:2207.09450, 2022.
- Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking. arXiv preprint arXiv:2309.01918, 2023.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660, 2021.
- Learning generalizable robotic reward functions from” in-the-wild” human videos. arXiv preprint arXiv:2103.16817, 2021.
- Open-television: Teleoperation with immersive active visual feedback. arXiv preprint arXiv:2407.01512, 2024.
- Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
- Universal manipulation interface: In-the-wild robot teaching without in-the-wild robots. arXiv preprint arXiv:2402.10329, 2024.
- Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215, 2019.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv preprint arXiv:2109.13396, 2021.
- General policies for zero-shot deployment in new environments. 2024.
- Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In Towards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition@ CoRL2023, 2023a.
- Anygrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Transactions on Robotics, 2023b.
- Low-cost exoskeletons for learning whole-arm manipulation in the wild. arXiv preprint arXiv:2309.14975, 2023c.
- Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024.
- Efficient data collection for robotic manipulation via compositional generalization. arXiv preprint arXiv:2403.05110, 2024.
- Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309, 2020.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
- Scaling laws for single-agent reinforcement learning. arXiv preprint arXiv:2301.13442, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv preprint arXiv:2311.17842, 2023a.
- For pre-trained vision models in motor control, not all policy learning methods are created equal. In International Conference on Machine Learning, pp. 13628–13651. PMLR, 2023b.
- Open teach: A versatile teleoperation system for robotic manipulation. arXiv preprint arXiv:2403.07870, 2024.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pp. 991–1002. PMLR, 2022.
- Scalable deep reinforcement learning for vision-based robotic manipulation. In Conference on robot learning, pp. 651–673. PMLR, 2018.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024.
- Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026, 2023.
- Neural scaling laws on graphs. arXiv preprint arXiv:2402.02054, 2024.
- Steven Lovegrove. Pangolin: A lightweight portable rapid development library for managing opengl display / interaction and abstracting video input. https://github.com/stevenlovegrove/Pangolin.
- Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030, 2022.
- Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv preprint arXiv:1703.09312, 2017.
- Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Conference on Robot Learning, pp. 879–893. PMLR, 2018.
- Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. arXiv preprint arXiv:2107.14483, 2021.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
- The surprising effectiveness of representation learning for visual imitation. arXiv preprint arXiv:2112.01511, 2021.
- Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205, 2023.
- The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191, 2024.
- Dexmv: Imitation learning for dexterous manipulation from human videos. In European Conference on Computer Vision, pp. 570–587. Springer, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241. Springer, 2015.
- On bringing robots home. arXiv preprint arXiv:2311.16098, 2023.
- Supervised policy learning for real robots, July 2024. URL https://supervised-robot-learning.github.io. Tutorial presented at the Robotics: Science and Systems (RSS), Delft.
- Multiple interactions made easy (mime): Large scale demonstrations data for imitation. In Conference on robot learning, pp. 906–915. PMLR, 2018.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
- Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations. IEEE Robotics and Automation Letters, 5(3):4978–4985, 2020b.
- Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213, 2024.
- Green screen augmentation enables scene generalisation in robotic manipulation. arXiv preprint arXiv:2407.07868, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Bridgedata v2: A dataset for robot learning at scale. In Conference on Robot Learning, pp. 1723–1736. PMLR, 2023.
- Mimicplay: Long-horizon imitation learning by watching human play. arXiv preprint arXiv:2302.12422, 2023.
- Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025, 2023.
- Gello: A general, low-cost, and intuitive teleoperation framework for robot manipulators. arXiv preprint arXiv:2309.13037, 2023.
- Decomposing the generalization gap in imitation learning for visual robotic manipulation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 3153–3160. IEEE, 2024.
- Kitchenshift: Evaluating zero-shot generalization of imitation-based policy learning under domain shifts. In NeurIPS 2021 Workshop on Distribution Shifts: Connecting Methods and Applications, 2021. URL https://openreview.net/forum?id=DdglKo8hBq0.
- Harmonic mobile manipulation, 2023.
- Visual imitation made easy. In Conference on Robot Learning, pp. 1992–2005. PMLR, 2021.
- 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In Proceedings of Robotics: Science and Systems (RSS), 2024.
- Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12104–12113, 2022.
- Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 5628–5635. IEEE, 2018.
- Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
- ALOHA unleashed: A simple recipe for robot dexterity. In 8th Annual Conference on Robot Learning, 2024. URL https://openreview.net/forum?id=gvdXE7ikHI.
- Learning generalizable manipulation policies with object-centric 3d representations. arXiv preprint arXiv:2310.14386, 2023a.
- Viola: Imitation learning for vision-based manipulation with object proposal priors. In Conference on Robot Learning, pp. 1199–1210. PMLR, 2023b.
- Vision-based manipulation from single human video with open-world object graphs. arXiv preprint arXiv:2405.20321, 2024.
- robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020.
- Fanqi Lin (7 papers)
- Yingdong Hu (16 papers)
- Pingyue Sheng (2 papers)
- Chuan Wen (21 papers)
- Jiacheng You (11 papers)
- Yang Gao (761 papers)