Papers
Topics
Authors
Recent
2000 character limit reached

DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models (2402.13181v1)

Published 20 Feb 2024 in cs.RO and cs.LG

Abstract: We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” arXiv preprint arXiv:2203.12601, 2022.
  2. R. Shah and V. Kumar, “Rrl: Resnet as representation for reinforcement learning,” arXiv preprint arXiv:2107.03380, 2021.
  3. I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” in Conference on Robot Learning.   PMLR, 2023, pp. 416–426.
  4. Y. Seo, D. Hafner, H. Liu, F. Liu, S. James, K. Lee, and P. Abbeel, “Masked world models for visual control,” in Conference on Robot Learning.   PMLR, 2023, pp. 1332–1344.
  5. M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  6. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
  7. J. Lu, P. Gong, J. Ye, and C. Zhang, “Learning from very few samples: A survey,” arXiv preprint arXiv:2009.02653, 2020.
  8. R. Boney and A. Ilin, “Semi-supervised and active few-shot learning with prototypical networks,” arXiv preprint arXiv:1711.10856, 2017.
  9. K. Alton and M. van de Panne, “Learning to steer on winding tracks using semi-parametric control policies,” in Proceedings of the 2005 IEEE International Conference on Robotics and Automation, 2005, pp. 4588–4593.
  10. D. Sharon and M. van de Panne, “Synthesis of controllers for stylized planar bipedal walking,” Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pp. 2387–2392, 2005.
  11. D. Shah and Q. Xie, “Q-learning with nearest neighbors,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  12. E. Mansimov and K. Cho, “Simple nearest neighbor policy method for continuous control tasks,” 2018. [Online]. Available: https://openreview.net/forum?id=ByL48G-AW
  13. J. Pari, N. M. Shafiullah, S. P. Arunachalam, and L. Pinto, “The surprising effectiveness of representation learning for visual imitation,” arXiv preprint arXiv:2112.01511, 2021.
  14. M. Du, S. Nair, D. Sadigh, and C. Finn, “Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets,” arXiv preprint arXiv:2304.08742, 2023.
  15. S. Izquierdo, M. Argus, and T. Brox, “Conditional visual servoing for multi-step tasks,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 2190–2196.
  16. P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,” arXiv preprint arXiv:1806.08756, 2018.
  17. L. Manuelli, W. Gao, P. Florence, and R. Tedrake, “kpam: Keypoint affordances for category-level robotic manipulation,” in Robotics Research: The 19th International Symposium ISRR.   Springer, 2022, pp. 132–157.
  18. M. Vecerik, J.-B. Regli, O. Sushkov, D. Barker, R. Pevceviciute, T. Rothörl, R. Hadsell, L. Agapito, and J. Scholz, “S3k: Self-supervised semantic keypoints for robotic manipulation via multi-view consistency,” in Conference on Robot Learning.   PMLR, 2021, pp. 449–460.
  19. M. Vecerik, C. Doersch, Y. Yang, T. Davchev, Y. Aytar, G. Zhou, R. Hadsell, L. Agapito, and J. Scholz, “Robotap: Tracking arbitrary points for few-shot visual imitation,” arXiv preprint arXiv:2308.15975, 2023.
  20. W. Goodwin, I. Havoutis, and I. Posner, “You only look at one: Category-level object representations for pose estimation from a single example,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id=lb7B5Rw7tjw
  21. D. Hadjivelichkov, S. Zwane, L. Agapito, M. P. Deisenroth, and D. Kanoulas, “One-shot transfer of affordance regions? affcorrs!” in Conference on Robot Learning.   PMLR, 2023, pp. 550–560.
  22. S. Amir, Y. Gandelsman, S. Bagon, and T. Dekel, “Deep vit features as dense visual descriptors,” arXiv preprint arXiv:2112.05814, vol. 2, no. 3, p. 4, 2021.
  23. V. Vosylius and E. Johns, “Few-shot in-context imitation learning via implicit graph alignment,” in Conference on Robot Learning.   PMLR, 2023, pp. 3194–3213.
  24. Y. Wang, Z. Li, M. Zhang, K. Driggs-Campbell, J. Wu, L. Fei-Fei, and Y. Li, “D33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTfields: Dynamic 3d descriptor fields for zero-shot generalizable robotic manipulation,” 2023.
  25. Y. Ju, K. Hu, G. Zhang, G. Zhang, M. Jiang, and H. Xu, “Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation,” 2024.
  26. Q. Wang, H. Zhang, C. Deng, Y. You, H. Dong, Y. Zhu, and L. Guibas, “Sparsedff: Sparse-view feature distillation for one-shot dexterous manipulation,” 2023.
  27. E. Johns, “Coarse-to-fine imitation learning: Robot manipulation from a single demonstration,” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 4613–4619.
  28. E. Valassakis, G. Papagiannis, N. Di Palo, and E. Johns, “Demonstrate once, imitate immediately (dome): Learning visual servoing for one-shot imitation learning,” arXiv preprint arXiv:2204.02863, 2022.
  29. P. Vitiello, K. Dreczkowski, and E. Johns, “One-shot imitation learning: A pose estimation perspective,” in Conference on Robot Learning.   PMLR, 2023, pp. 943–970.
  30. M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on Robot Learning.   PMLR, 2022, pp. 894–906.
  31. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
  32. O. Sorkine-Hornung and M. Rabinovich, “Least-squares rigid motion using svd,” Computing, vol. 1, no. 1, pp. 1–5, 2017.
  33. S. Oomori, T. Nishida, and S. Kurogi, “Point cloud matching using singular value decomposition,” Artificial Life and Robotics, vol. 21, pp. 149–154, 06 2016.
  34. N. Di Palo and E. Johns, “Learning multi-stage tasks with one demonstration via self-replay,” in Conference on Robot Learning.   PMLR, 2022, pp. 1180–1189.
  35. A. Simeonov, Y. Du, Y.-C. Lin, A. R. Garcia, L. P. Kaelbling, T. Lozano-Pérez, and P. Agrawal, “Se (3)-equivariant relational rearrangement with neural descriptor fields,” in Conference on Robot Learning.   PMLR, 2023, pp. 835–846.
  36. S. Haldar, J. Pari, A. Rai, and L. Pinto, “Teach a robot to fish: Versatile imitation from one minute of demonstrations,” arXiv preprint arXiv:2303.01497, 2023.
  37. A. Gupta, V. Kumar, C. Lynch, S. Levine, and K. Hausman, “Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning,” arXiv preprint arXiv:1910.11956, 2019.
  38. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  39. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  40. C. Doersch, Y. Yang, M. Vecerik, D. Gokay, A. Gupta, Y. Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per-frame initialization and temporal refinement,” arXiv preprint arXiv:2306.08637, 2023.
  41. M. Argus, L. Hermann, J. Long, and T. Brox, “Flowcontrol: Optical flow based visual servoing,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 7534–7541.
  42. Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16.   Springer, 2020, pp. 402–419.
  43. H. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale benchmark for general object grasping,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11 441–11 450, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219964473
Citations (16)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 4 likes about this paper.

Youtube Logo Streamline Icon: https://streamlinehq.com