On the Effectiveness of Retrieval, Alignment, and Replay in Manipulation (2312.12345v1)
Abstract: Imitation learning with visual observations is notoriously inefficient when addressed with end-to-end behavioural cloning methods. In this paper, we explore an alternative paradigm which decomposes reasoning into three phases. First, a retrieval phase, which informs the robot what it can do with an object. Second, an alignment phase, which informs the robot where to interact with the object. And third, a replay phase, which informs the robot how to interact with the object. Through a series of real-world experiments on everyday tasks, such as grasping, pouring, and inserting objects, we show that this decomposition brings unprecedented learning efficiency, and effective inter- and intra-class generalisation. Videos are available at https://www.robot-learning.uk/retrieval-alignment-replay.
- Shir Amir et al. Deep vit features as dense visual descriptors. arXiv:2112.05814, 2021.
- Jessica Borja-Diaz et al. Affordance learning from play for sample-efficient policy learning. arXiv:2203.00352, 2022.
- Mathilde Caron et al. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- Open X-Embodiment Collaboration. Open X-Embodiment: Robotic learning datasets and RT-X models. arXiv:2310.08864, 2023.
- Learning multi-stage tasks with one demonstration via self-replay. In CoRL. PMLR, 2021.
- Alexey Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Behavior retrieval: Few-shot imitation learning by querying unlabeled datasets. arXiv preprint arXiv:2304.08742, 2023.
- Kaiming He et al. Deep residual learning for image recognition. In CVPR, 2016.
- Edward Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In ICRA, 2021.
- Michelle A Lee et al. Guided uncertainty-aware policy optimization: Combining learning and model-based strategies for sample-efficient policy learning. In ICRA. IEEE, 2020.
- Suraj Nair et al. R3m: A universal visual representation for robot manipulation. arXiv:2203.12601, 2022.
- Jyothish Pari et al. The surprising effectiveness of representation learning for visual imitation. arXiv:2112.01511, 2021.
- Alec Radford et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
- Eugene Valassakis et al. Demonstrate once, imitate immediately (dome): Learning visual servoing for one-shot imitation learning. arXiv:2204.02863, 2022.
- One-shot imitation learning: A pose estimation perspective. In CoRL, 2023.
- Few-shot in-context imitation learning via implicit graph alignment. In CoRL, 2023a.
- Where to start? transferring simple skills to complex environments. In CoRL, pages 471–481. PMLR, 2023b.
- Bowen Wen et al. You only demonstrate once: Category-level manipulation from single visual demonstration. arXiv:2201.12716, 2022.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.