DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models (2402.13181v1)
Abstract: We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot.
- S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, “R3m: A universal visual representation for robot manipulation,” arXiv preprint arXiv:2203.12601, 2022.
- R. Shah and V. Kumar, “Rrl: Resnet as representation for reinforcement learning,” arXiv preprint arXiv:2107.03380, 2021.
- I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” in Conference on Robot Learning. PMLR, 2023, pp. 416–426.
- Y. Seo, D. Hafner, H. Liu, F. Liu, S. James, K. Lee, and P. Abbeel, “Masked world models for visual control,” in Conference on Robot Learning. PMLR, 2023, pp. 1332–1344.
- M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
- M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
- J. Lu, P. Gong, J. Ye, and C. Zhang, “Learning from very few samples: A survey,” arXiv preprint arXiv:2009.02653, 2020.
- R. Boney and A. Ilin, “Semi-supervised and active few-shot learning with prototypical networks,” arXiv preprint arXiv:1711.10856, 2017.
- K. Alton and M. van de Panne, “Learning to steer on winding tracks using semi-parametric control policies,” in Proceedings of the 2005 IEEE International Conference on Robotics and Automation, 2005, pp. 4588–4593.
- D. Sharon and M. van de Panne, “Synthesis of controllers for stylized planar bipedal walking,” Proceedings of the 2005 IEEE International Conference on Robotics and Automation, pp. 2387–2392, 2005.
- D. Shah and Q. Xie, “Q-learning with nearest neighbors,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- E. Mansimov and K. Cho, “Simple nearest neighbor policy method for continuous control tasks,” 2018. [Online]. Available: https://openreview.net/forum?id=ByL48G-AW
- J. Pari, N. M. Shafiullah, S. P. Arunachalam, and L. Pinto, “The surprising effectiveness of representation learning for visual imitation,” arXiv preprint arXiv:2112.01511, 2021.
- M. Du, S. Nair, D. Sadigh, and C. Finn, “Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets,” arXiv preprint arXiv:2304.08742, 2023.
- S. Izquierdo, M. Argus, and T. Brox, “Conditional visual servoing for multi-step tasks,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022, pp. 2190–2196.
- P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,” arXiv preprint arXiv:1806.08756, 2018.
- L. Manuelli, W. Gao, P. Florence, and R. Tedrake, “kpam: Keypoint affordances for category-level robotic manipulation,” in Robotics Research: The 19th International Symposium ISRR. Springer, 2022, pp. 132–157.
- M. Vecerik, J.-B. Regli, O. Sushkov, D. Barker, R. Pevceviciute, T. Rothörl, R. Hadsell, L. Agapito, and J. Scholz, “S3k: Self-supervised semantic keypoints for robotic manipulation via multi-view consistency,” in Conference on Robot Learning. PMLR, 2021, pp. 449–460.
- M. Vecerik, C. Doersch, Y. Yang, T. Davchev, Y. Aytar, G. Zhou, R. Hadsell, L. Agapito, and J. Scholz, “Robotap: Tracking arbitrary points for few-shot visual imitation,” arXiv preprint arXiv:2308.15975, 2023.
- W. Goodwin, I. Havoutis, and I. Posner, “You only look at one: Category-level object representations for pose estimation from a single example,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id=lb7B5Rw7tjw
- D. Hadjivelichkov, S. Zwane, L. Agapito, M. P. Deisenroth, and D. Kanoulas, “One-shot transfer of affordance regions? affcorrs!” in Conference on Robot Learning. PMLR, 2023, pp. 550–560.
- S. Amir, Y. Gandelsman, S. Bagon, and T. Dekel, “Deep vit features as dense visual descriptors,” arXiv preprint arXiv:2112.05814, vol. 2, no. 3, p. 4, 2021.
- V. Vosylius and E. Johns, “Few-shot in-context imitation learning via implicit graph alignment,” in Conference on Robot Learning. PMLR, 2023, pp. 3194–3213.
- Y. Wang, Z. Li, M. Zhang, K. Driggs-Campbell, J. Wu, L. Fei-Fei, and Y. Li, “D33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTfields: Dynamic 3d descriptor fields for zero-shot generalizable robotic manipulation,” 2023.
- Y. Ju, K. Hu, G. Zhang, G. Zhang, M. Jiang, and H. Xu, “Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation,” 2024.
- Q. Wang, H. Zhang, C. Deng, Y. You, H. Dong, Y. Zhu, and L. Guibas, “Sparsedff: Sparse-view feature distillation for one-shot dexterous manipulation,” 2023.
- E. Johns, “Coarse-to-fine imitation learning: Robot manipulation from a single demonstration,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 4613–4619.
- E. Valassakis, G. Papagiannis, N. Di Palo, and E. Johns, “Demonstrate once, imitate immediately (dome): Learning visual servoing for one-shot imitation learning,” arXiv preprint arXiv:2204.02863, 2022.
- P. Vitiello, K. Dreczkowski, and E. Johns, “One-shot imitation learning: A pose estimation perspective,” in Conference on Robot Learning. PMLR, 2023, pp. 943–970.
- M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” in Conference on Robot Learning. PMLR, 2022, pp. 894–906.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
- O. Sorkine-Hornung and M. Rabinovich, “Least-squares rigid motion using svd,” Computing, vol. 1, no. 1, pp. 1–5, 2017.
- S. Oomori, T. Nishida, and S. Kurogi, “Point cloud matching using singular value decomposition,” Artificial Life and Robotics, vol. 21, pp. 149–154, 06 2016.
- N. Di Palo and E. Johns, “Learning multi-stage tasks with one demonstration via self-replay,” in Conference on Robot Learning. PMLR, 2022, pp. 1180–1189.
- A. Simeonov, Y. Du, Y.-C. Lin, A. R. Garcia, L. P. Kaelbling, T. Lozano-Pérez, and P. Agrawal, “Se (3)-equivariant relational rearrangement with neural descriptor fields,” in Conference on Robot Learning. PMLR, 2023, pp. 835–846.
- S. Haldar, J. Pari, A. Rai, and L. Pinto, “Teach a robot to fish: Versatile imitation from one minute of demonstrations,” arXiv preprint arXiv:2303.01497, 2023.
- A. Gupta, V. Kumar, C. Lynch, S. Levine, and K. Hausman, “Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning,” arXiv preprint arXiv:1910.11956, 2019.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- C. Doersch, Y. Yang, M. Vecerik, D. Gokay, A. Gupta, Y. Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per-frame initialization and temporal refinement,” arXiv preprint arXiv:2306.08637, 2023.
- M. Argus, L. Hermann, J. Long, and T. Brox, “Flowcontrol: Optical flow based visual servoing,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 7534–7541.
- Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 2020, pp. 402–419.
- H. Fang, C. Wang, M. Gou, and C. Lu, “Graspnet-1billion: A large-scale benchmark for general object grasping,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11 441–11 450, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:219964473
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.