RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Abstract: The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.
- Maximum a posteriori policy optimisation. In International Conference on Learning Representations (ICLR), 2018.
- Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2021.
- Hindsight experience replay. In Neural Information Processing Systems (NeurIPS), 2017.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2022.
- N. Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Inc., 2014.
- High-performance large-scale image recognition without normalization. In International Conference on Machine Learning (ICML), 2021.
- RT-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Language models are few-shot learners. In Neural Information Processing Systems (NeurIPS), 2020.
- Yale-CMU-Berkeley dataset for robotic manipulation research. International Journal of Robotics Research (IJRR), 2017.
- Emerging properties in self-supervised vision transformers. In International Conference on Computer Vision (ICCV), 2021.
- Robust policies via mid-level visual representations: An experimental study in manipulation and navigation. arXiv preprint arXiv:2011.06698, 2020a.
- Decision transformer: Reinforcement learning via sequence modeling. In Neural Information Processing Systems (NeurIPS), 2021.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), 2020b.
- S. Dasari and A. Gupta. Transformers for one-shot visual imitation. In Conference on Robot Learning (CoRL), 2021.
- Wish you were here: Hindsight goal selection for long-horizon dexterous manipulation. In International Conference on Learning Representations (ICLR), 2022.
- ImageNet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Taming transformers for high-resolution image synthesis. In Computer Vision and Pattern Recognition (CVPR), 2021.
- Learning to reach goals via iterated supervised learning. arXiv preprint arXiv:1912.06088, 2019.
- Bootstrap your own latent a new approach to self-supervised learning. In Neural Information Processing Systems (NeurIPS), 2020.
- Goal-conditioned end-to-end visuomotor control for versatile skill primitives. In International Conference on Robotics and Automation (ICRA), 2021.
- Metamorph: Learning universal controllers with transformers. In International Conference on Learning Representations (ICLR), 2022.
- MaskViT: Masked visual pre-training for video prediction. In International Conference on Learning Representations (ICLR), 2023.
- Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
- Masked autoencoders are scalable vision learners. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Efficient visual pretraining with contrastive detection. In International Conference on Computer Vision (ICCV), 2021.
- Object discovery and representation networks. In European Conference on Computer Vision (ECCV), 2022.
- Deep networks with stochastic depth. In European Conference on Computer Vision (ECCV), 2016.
- Offline reinforcement learning as one big sequence modeling problem. In Neural Information Processing Systems (NeurIPS), 2021.
- VIMA: General robot manipulation with multimodal prompts. arXiv preprint arXiv:2210.03094, 2022.
- L. P. Kaelbling. Learning to achieve goals. In International Joint Conference on Artificial Intelligence (IJCAI), 1993.
- Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021.
- Simple but effective: Clip embeddings for embodied AI. In Computer Vision and Pattern Recognition (CVPR), 2022.
- Benchmarking protocols for evaluating small parts robotic assembly systems. Robotics and Automation Letters (RA-L), 2020.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
- ENTL: Embodied navigation trajectory learner. arXiv preprint arXiv:2304.02639, 2023.
- My body is a cage: the role of morphology in graph-based incompatible control. arXiv preprint arXiv:2010.01856, 2020.
- Beyond pick-and-place: Tackling robotic stacking of diverse shapes. In Conference on Robot Learning (CoRL), 2021.
- Multi-game decision transformers. In Neural Information Processing Systems (NeurIPS), 2022.
- Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV), 2021.
- I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023.
- Visual reinforcement learning with imagined goals. In Neural Information Processing Systems (NeurIPS), 2018.
- R3M: A universal visual representation for robot manipulation. In Conference on Robot Learning (CoRL), 2022.
- S. Omohundro. The basic AI drives. In Conference on Artificial General Intelligence, 2008.
- An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 2018.
- Training language models to follow instructions with human feedback. In Neural Information Processing Systems (NeurIPS), 2022.
- The unsurprising effectiveness of pre-trained vision models for control. arXiv preprint arXiv:2203.03580, 2022.
- Stabilizing transformers for reinforcement learning. In International Conference on Machine Learning (ICML), 2020.
- Asymmetric actor critic for image-based robot learning. In Robotics: Science and Systems (RSS), 2018.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
- Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
- Real-world robot learning with masked visual pre-training. In Conference on Robot Learning (CoRL), 2023.
- A generalist agent. Transactions on Machine Learning Research (TMLR), 2022.
- ImageNet-21K pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- Scaling vision with sparse mixture of experts. In Neural Information Processing Systems (NeurIPS), 2021.
- Rapid task-solving in novel environments. arXiv preprint arXiv:2006.03662, 2020.
- S. Russell. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.
- Mid-level visual representations improve generalization and sample efficiency for learning visuomotor policies. arXiv preprint arXiv:1812.11971, 2018.
- Universal value function approximators. In International Conference on Machine Learning (ICML), 2015.
- Lossless adaptation of pretrained vision models for robotic manipulation. arXiv preprint arXiv:2304.06600, 2023.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning (CoRL), 2023.
- Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 2014.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Human-timescale adaptation in an open-ended task space. In International Conference on Machine Learning (ICML), 2023.
- Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems (IROS), 2012.
- Optimal policies tend to seek power. In Neural Information Processing Systems (NeurIPS), 2021.
- Neural discrete representation learning. In Neural Information Processing Systems (NeurIPS), 2017.
- Attention is all you need. In Neural Information Processing Systems (NeurIPS), 2017.
- Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
- Learning to see before learning to act: Visual pre-training for manipulation. In International Conference on Robotics and Automation (ICRA), 2020.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL), 2020.
- Deep reinforcement learning with relational inductive biases. In International Conference on Learning Representations (ICLR), 2018.
- The unreasonable effectiveness of deep features as a perceptual metric. In Computer Vision and Pattern Recognition (CVPR), 2018.
- Does computer vision matter for action? Science Robotics, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.