Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robot Learning with Sensorimotor Pre-training (2306.10007v2)

Published 16 Jun 2023 in cs.RO, cs.CV, and cs.LG

Abstract: We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and actions, we encode the sequence into tokens, mask out a subset, and train a model to predict the missing content from the rest. We hypothesize that if a robot can predict the masked-out content it will have acquired a good model of the physical world that can enable it to act. RPT is designed to operate on latent visual representations which makes prediction tractable, enables scaling to larger models, and allows fast inference on a real robot. To evaluate our approach, we collected a dataset of 20,000 real-world trajectories over 9 months using a combination of motion planning and grasping algorithms. We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Imagenet classification with deep convolutional neural networks. NeurIPS, 2012.
  2. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  3. Generative pretraining from pixels. In ICML, 2020.
  4. Masked autoencoders are scalable vision learners. arXiv:2111.06377, 2021.
  5. Improving language understanding by generative pre-training. 2018.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HCT, 2019.
  7. Language models are few-shot learners. NeurIPS, 2020.
  8. Real-world robot learning with masked visual pre-training. arXiv:2210.03109, 2022.
  9. Attention is all you need. In NeurIPS, 2017.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  11. Ego4d: Around the world in 3,000 hours of egocentric video. arXiv:2110.07058, 2021.
  12. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. IJCV, 2021.
  13. The” something something” video database for learning and evaluating visual common sense. In ICCV, 2017.
  14. Understanding human hands in contact at internet scale. In CVPR, 2020.
  15. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  16. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv:1806.10293, 2018.
  17. L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours. In ICRA, 2016.
  18. Self-supervised correspondence in visuomotor policy learning. RA-L, 2019.
  19. Exploratory grasping: Asymptotically optimal algorithms for grasping challenging polyhedral objects. arXiv:2011.05632, 2020.
  20. Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv, 2021.
  21. Learning visual robotic control efficiently with contrastive pre-training and data augmentation. In IROS, 2022.
  22. From play to policy: Conditional behavior generation from uncurated robot data. arXiv:2210.10047, 2022.
  23. Multi-view masked world models for visual robotic manipulation. arXiv:2302.02408, 2023.
  24. Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, 2022.
  25. Bridge data: Boosting generalization of robotic skills with cross-domain datasets. arXiv:2109.13396, 2021.
  26. Perceiver-actor: A multi-task transformer for robotic manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL), 2022.
  27. Pre-training for robots: Offline rl enables learning new tasks from a handful of trials. arXiv:2210.05178, 2022.
  28. A generalist agent. arXiv:2205.06175, 2022.
  29. Rt-1: Robotics transformer for real-world control at scale. arXiv:2212.06817, 2022.
  30. Palm-e: An embodied multimodal language model. arXiv:2303.03378, 2023.
  31. On-policy dataset synthesis for learning robot grasping policies using fully convolutional deep networks. RAL, 2019.
Citations (39)

Summary

We haven't generated a summary for this paper yet.