Diffusion Meets DAgger: Supercharging Eye-in-hand Imitation Learning (2402.17768v2)
Abstract: A common failure mode for policies trained with imitation is compounding execution errors at test time. When the learned policy encounters states that are not present in the expert demonstrations, the policy fails, leading to degenerate behavior. The Dataset Aggregation, or DAgger approach to this problem simply collects more data to cover these failure states. However, in practice, this is often prohibitively expensive. In this work, we propose Diffusion Meets DAgger (DMD), a method to reap the benefits of DAgger without the cost for eye-in-hand imitation learning problems. Instead of collecting new samples to cover out-of-distribution states, DMD uses recent advances in diffusion models to synthesize these samples. This leads to robust performance from few demonstrations. We compare DMD against behavior cloning baseline across four tasks: pushing, stacking, pouring, and shirt hanging. In pushing, DMD achieves 80% success rate with as few as 8 expert demonstrations, where naive behavior cloning reaches only 20%. In stacking, DMD succeeds on average 92% of the time across 5 cups, versus 40% for BC. When pouring coffee beans, DMD transfers to another cup successfully 80% of the time. Finally, DMD attains 90% success rate for hanging shirt on a clothing rack.
- Diffusion policies for out-of-distribution generalization in offline reinforcement learning. arXiv preprint arXiv:2307.04726, 2023.
- Vision-only robot navigation in a neural radiance world. IEEE Robotics and Automation Letters, 7(2):4606–4613, 2022.
- Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
- A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
- ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
- Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639, 2023.
- Neural fields for robotic object manipulation from a single image. arXiv preprint arXiv:2210.12126, 2022.
- Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023.
- Look ma, no hands! agent-environment factorization of egocentric videos. In Advances in Neural Information Processing Systems, 2023.
- Simple hierarchical planning with diffusion. arXiv preprint arXiv:2401.02644, 2024.
- Playfusion: Skill acquisition via diffusion from language-annotated play. In Conference on Robot Learning, pages 2012–2029. PMLR, 2023a.
- Genaug: Retargeting behaviors to unseen situations via generative augmentation. arXiv preprint arXiv:2302.06671, 2023b.
- Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023a.
- Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023b.
- Taming transformers for high-resolution image synthesis. arXiv, 2020.
- Self-supervised correspondence in visuomotor policy learning. IEEE Robotics and Automation Letters, 5(2):492–499, 2019.
- A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 1(2):661–667, 2016. doi: 10.1109/LRA.2015.2509024.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning, pages 11808–11826. PMLR, 2023.
- Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
- Multiple view geometry in computer vision. Cambridge university press, 2003.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Generating behaviorally diverse policies with latent diffusion models. arXiv preprint arXiv:2305.18738, 2023.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
- Sample generations for reinforcement learning via diffusion models. 2023.
- Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
- Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022.
- Dall-e-bot: Introducing web-scale diffusion models to robotics. IEEE Robotics and Automation Letters, 2023.
- Grasping with chopsticks: Combating covariate shift in model-free imitation learning for fine manipulation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 6185–6191. IEEE, 2021.
- Ccil: Continuity-based data augmentation for corrective imitation learning. arXiv preprint arXiv:2310.12972, 2023.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2012.
- Vivid-1-to-3: Novel view synthesis with video diffusion models. arXiv preprint arXiv:2312.01305, 2023.
- Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 462–469. IEEE, 2016.
- Dart: Noise injection for robust imitation learning. In Conference on robot learning, pages 143–156. PMLR, 2017.
- Hierarchical diffusion for offline decision making. In International Conference on Machine Learning, pages 20035–20064. PMLR, 2023a.
- Crossway diffusion: Improving diffusion-based visuomotor policy via self-supervised learning. arXiv preprint arXiv:2307.01849, 2023b.
- 3d neural scene representations for visuomotor control. In Conference on Robot Learning, pages 112–123. PMLR, 2022.
- Adaptdiffuser: Diffusion models as adaptive self-evolving planners. arXiv preprint arXiv:2302.01877, 2023a.
- Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution. arXiv preprint arXiv:2312.11598, 2023b.
- Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
- Synthetic experience replay. arXiv preprint arXiv:2303.06614, 2023.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
- Learning latent plans from play. In Conference on robot learning, pages 1113–1132. PMLR, 2020.
- Cacti: A framework for scalable multi-task multi-scene visual imitation learning. arXiv preprint arXiv:2212.05711, 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- The surprising effectiveness of representation learning for visual imitation. arXiv preprint arXiv:2112.01511, 2021.
- Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.
- Dean A Pomerleau. Efficient training of artificial neural networks for autonomous navigation. Neural computation, 3(1):88–97, 1991.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. 2022 ieee. In CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635, 2011.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Stefan Schaal. Learning from demonstration. Advances in neural information processing systems, 9, 1996.
- Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision (ECCV), 2016.
- Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
- On bringing robots home. arXiv preprint arXiv:2311.16098, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
- Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations. Robotics and Automation Letters, 2020.
- Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
- A-nerf: Articulated neural radiance fields for learning human shape, appearance, and pose. Advances in Neural Information Processing Systems, 34:12278–12291, 2021.
- Nerfstudio: A modular framework for neural radiance field development. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–12, 2023.
- Sebastian Thrun. Probabilistic robotics. Communications of the ACM, 45(3):52–57, 2002.
- Consistent view synthesis with pose-guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16773–16783, 2023.
- d3superscript𝑑3d^{3}italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT fields: Dynamic 3d descriptor fields for zero-shot generalizable robotic manipulation. arXiv preprint arXiv:2309.16118, 2023.
- Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.
- Novel view synthesis with diffusion models. arXiv preprint arXiv:2210.04628, 2022.
- Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In Conference on Robot Learning, pages 2323–2339. PMLR, 2023.
- Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
- Freenerf: Improving few-shot neural rendering with free frequency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8254–8263, 2023.
- Affordance diffusion: Synthesizing hand-object interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22479–22489, 2023.
- Visual imitation made easy. In Conference on Robot Learning, pages 1992–2005. PMLR, 2021.
- Playful interactions for representation learning. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 992–999. IEEE, 2022.
- pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4578–4587, 2021.
- Long-term photometric consistent novel view synthesis with diffusion models. arXiv preprint arXiv:2304.10700, 2023a.
- Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023b.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635. IEEE, 2018.
- Learning fine-grained bimanual manipulation with low-cost hardware. arXiv preprint arXiv:2304.13705, 2023.
- Nerf in the palm of your hand: Corrective augmentation for robotics via novel-view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17907–17917, 2023.
- Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564, 2018.
- Xiaoyu Zhang (144 papers)
- Matthew Chang (11 papers)
- Pranav Kumar (11 papers)
- Saurabh Gupta (96 papers)