Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Pre-Training for Visuo-Motor Control: Revisiting a Learning-from-Scratch Baseline (2212.05749v2)

Published 12 Dec 2022 in cs.LG, cs.CV, and cs.RO

Abstract: In this paper, we examine the effectiveness of pre-training for visuo-motor control tasks. We revisit a simple Learning-from-Scratch (LfS) baseline that incorporates data augmentation and a shallow ConvNet, and find that this baseline is surprisingly competitive with recent approaches (PVR, MVP, R3M) that leverage frozen visual representations trained on large-scale vision datasets -- across a variety of algorithms, task domains, and metrics in simulation and on a real robot. Our results demonstrate that these methods are hindered by a significant domain gap between the pre-training datasets and current benchmarks for visuo-motor control, which is alleviated by finetuning. Based on our findings, we provide recommendations for future research in pre-training for control and hope that our simple yet strong baseline will aid in accurately benchmarking progress in this area.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  2. Rt-1: Robotics transformer for real-world control at scale. In arXiv preprint arXiv:2212.06817, 2022.
  3. Language models are few-shot learners. ArXiv, abs/2005.14165, 2020.
  4. When vision transformers outperform resnets without pretraining or strong data augmentations. ArXiv, abs/2106.01548, 2021.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019.
  7. Unsupervised visual representation learning by context prediction. 2015 IEEE International Conference on Computer Vision (ICCV), pp.  1422–1430, 2015.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2021.
  9. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18995–19012, 2022.
  10. Generalization in reinforcement learning by soft data augmentation. In International Conference on Robotics and Automation (ICRA), 2021.
  11. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. In NeurIPS, 2021.
  12. Modem: Accelerating visual model-based reinforcement learning with demonstrations. arXiv preprint, 2022a.
  13. Temporal difference learning for model predictive control. In ICML, 2022b.
  14. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  770–778, 2015.
  15. Momentum contrast for unsupervised visual representation learning. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  9726–9735, 2020.
  16. Masked autoencoders are scalable vision learners. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  15979–15988, 2022.
  17. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. ArXiv, abs/2004.13649, 2021.
  18. Reinforcement learning with augmented data. ArXiv, abs/2004.14990, 2020.
  19. A simple randomization technique for generalization in deep reinforcement learning. ArXiv, abs/1910.05396, 2019.
  20. Continuous control with deep reinforcement learning. CoRR, abs/1509.02971, 2016.
  21. A comprehensive survey of data augmentation in visual reinforcement learning. arXiv preprint arXiv:2210.04561, 2022.
  22. R3m: A universal visual representation for robot manipulation. ArXiv, abs/2203.12601, 2022.
  23. The unsurprising effectiveness of pre-trained vision models for control. In ICML, 2022.
  24. Asymmetric actor critic for image-based robot learning. arXiv preprint arXiv:1710.06542, 2017.
  25. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763, 2021.
  26. Automatic data augmentation for generalization in deep reinforcement learning. arXiv preprint arXiv:2006.12862, 2020.
  27. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations. In Proceedings of Robotics: Science and Systems (RSS), 2018.
  28. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115:211–252, 2015.
  29. Proximal policy optimization algorithms. ArXiv, abs/1707.06347, 2017.
  30. Data-efficient reinforcement learning with self-predictive representations. In ICLR, 2021.
  31. Unsupervised perceptual rewards for imitation learning. ArXiv, abs/1612.06699, 2016.
  32. Rrl: Resnet as representation for reinforcement learning. ArXiv, abs/2107.03380, 2021.
  33. Reinforcement learning with latent flow. In Neural Information Processing Systems, 2021.
  34. Curl: Contrastive unsupervised representations for reinforcement learning. In ICML, 2020.
  35. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  36. Deepmind control suite. Technical report, DeepMind, 2018.
  37. Domain randomization for transferring deep neural networks from simulation to the real world. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Sep 2017.
  38. Representation learning with contrastive predictive coding. ArXiv, abs/1807.03748, 2018.
  39. Vrl3: A data-driven framework for visual deep reinforcement learning. ArXiv, abs/2202.10324, 2022.
  40. Masked visual pre-training for motor control. ArXiv, abs/2203.06173, 2022.
  41. On the feasibility of cross-task transfer with model-based reinforcement learning. arXiv preprint arXiv:2210.10763, 2022.
  42. Improving sample efficiency in model-free reinforcement learning from images. 2019.
  43. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.
  44. Pre-trained image encoder for generalizable visual reinforcement learning. arXiv preprint arXiv:2212.08860, 2022.
  45. Visual reinforcement learning with self-supervised 3d representations. arXiv preprint arXiv:2210.07241, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Nicklas Hansen (22 papers)
  2. Zhecheng Yuan (18 papers)
  3. Yanjie Ze (20 papers)
  4. Tongzhou Mu (19 papers)
  5. Aravind Rajeswaran (42 papers)
  6. Hao Su (219 papers)
  7. Huazhe Xu (93 papers)
  8. Xiaolong Wang (243 papers)
Citations (55)

Summary

We haven't generated a summary for this paper yet.