Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

For Pre-Trained Vision Models in Motor Control, Not All Policy Learning Methods are Created Equal (2304.04591v2)

Published 10 Apr 2023 in cs.CV and cs.RO

Abstract: In recent years, increasing attention has been directed to leveraging pre-trained vision models for motor control. While existing works mainly emphasize the importance of this pre-training phase, the arguably equally important role played by downstream policy learning during control-specific fine-tuning is often neglected. It thus remains unclear if pre-trained vision models are consistent in their effectiveness under different control policies. To bridge this gap in understanding, we conduct a comprehensive study on 14 pre-trained vision models using 3 distinct classes of policy learning methods, including reinforcement learning (RL), imitation learning through behavior cloning (BC), and imitation learning with a visual reward function (VRF). Our study yields a series of intriguing results, including the discovery that the effectiveness of pre-training is highly dependent on the choice of the downstream policy learning algorithm. We show that conventionally accepted evaluation based on RL methods is highly variable and therefore unreliable, and further advocate for using more robust methods like VRF and BC. To facilitate more universal evaluations of pre-trained models and their policy learning methods in the future, we also release a benchmark of 21 tasks across 3 different environments alongside our work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Information Processing Systems, 34, 2021.
  2. What matters in on-policy reinforcement learning? a large-scale empirical study. arXiv preprint arXiv:2006.05990, 2020.
  3. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
  5. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  6. Vicreg: Variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906, 2021.
  7. Vicregl: Self-supervised learning of local visual features. arXiv preprint arXiv:2210.01571, 2022.
  8. Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6, 1993.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pp.  132–149, 2018.
  11. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
  12. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9650–9660, 2021.
  13. Measuring the reliability of reinforcement learning algorithms. arXiv preprint arXiv:1912.05663, 2019.
  14. Learning generalizable robotic reward functions from” in-the-wild” human videos. arXiv preprint arXiv:2103.16817, 2021a.
  15. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020a.
  16. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15750–15758, 2021.
  17. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
  18. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9640–9649, 2021b.
  19. Imitation learning from pixel observations for continuous control. 2021.
  20. Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013.
  21. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  720–736, 2018.
  22. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  23. Maskclip: Masked self-distillation advances contrastive language-image pretraining. arXiv preprint arXiv:2208.12262, 2022.
  24. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  25. Ethayarajh, K. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512, 2019.
  26. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
  27. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18995–19012, 2022.
  28. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  29. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019.
  30. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp.  1735–1742. IEEE, 2006.
  31. Learning latent dynamics for planning from pixels. In International conference on machine learning, pp. 2555–2565. PMLR, 2019.
  32. Watch and match: Supercharging imitation with regularized optimal transport. arXiv preprint arXiv:2206.15469, 2022.
  33. On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. arXiv preprint arXiv:2212.05749, 2022.
  34. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  35. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9729–9738, 2020.
  36. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16000–16009, 2022.
  37. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  38. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  39. Semantic-aware fine-grained correspondence. In European Conference on Computer Vision, pp.  97–115. Springer, 2022.
  40. Imitation learning: A survey of learning methods. ACM Computing Surveys (CSUR), 50(2):1–35, 2017.
  41. Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
  42. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  43. Simple but effective: Clip embeddings for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14829–14838, 2022.
  44. Khatib, O. Inertial properties in robotic manipulation: An object-level framework. The international journal of robotics research, 14(1):19–36, 1995.
  45. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  46. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. arXiv preprint arXiv:1809.02925, 2018.
  47. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649, 2020.
  48. Graph inverse reinforcement learning from diverse videos. Conference on Robot Learning (CoRL), 2022.
  49. Reinforcement learning with augmented data. Advances in neural information processing systems, 33:19884–19895, 2020a.
  50. Curl: Contrastive unsupervised representations for reinforcement learning. In International Conference on Machine Learning, pp. 5639–5650. PMLR, 2020b.
  51. LeCun, Y. A path towards autonomous machine intelligence. preprint posted on openreview, 2022.
  52. On the sentence embeddings from pre-trained language models. arXiv preprint arXiv:2011.05864, 2020.
  53. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  54. Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030, 2022.
  55. What matters in learning from offline human demonstrations for robot manipulation. In arXiv preprint arXiv:2108.03298, 2021.
  56. Slip: Self-supervision meets language-image pre-training. In European Conference on Computer Vision, pp.  529–544. Springer, 2022.
  57. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
  58. Algorithms for inverse reinforcement learning. In Icml, volume 1, pp.  2, 2000.
  59. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  60. Imitation learning with sinkhorn distances. arXiv preprint arXiv:2008.09167, 2020.
  61. The unsurprising effectiveness of pre-trained vision models for control. arXiv preprint arXiv:2203.03580, 2022.
  62. Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
  63. Pomerleau, D. A. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.
  64. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
  65. Real-world robot learning with masked visual pre-training. arXiv preprint arXiv:2210.03109, 2022.
  66. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  627–635. JMLR Workshop and Conference Proceedings, 2011.
  67. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  68. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9339–9347, 2019.
  69. Data-efficient reinforcement learning with self-predictive representations. arXiv preprint arXiv:2007.05929, 2020.
  70. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pp.  1134–1141. IEEE, 2018.
  71. Rrl: Resnet as representation for reinforcement learning. arXiv preprint arXiv:2107.03380, 2021.
  72. Understanding human hands in contact at internet scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9869–9878, 2020.
  73. Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pp.  894–906. PMLR, 2022.
  74. Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. PMLR, 2014.
  75. Reinforcement learning: An introduction. MIT press, 2018.
  76. A survey on deep transfer learning. In International conference on artificial neural networks, pp.  270–279. Springer, 2018.
  77. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  78. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.  5026–5033. IEEE, 2012.
  79. Vrl3: A data-driven framework for visual deep reinforcement learning. arXiv preprint arXiv:2202.10324, 2022.
  80. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3024–3033, 2021.
  81. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14668–14678, 2022.
  82. Fighting copycat agents in behavioral cloning from observation histories. Advances in Neural Information Processing Systems, 33:2564–2575, 2020.
  83. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3733–3742, 2018.
  84. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
  85. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16684–16693, 2021.
  86. Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10075–10085, 2021.
  87. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021a.
  88. Improving sample efficiency in model-free reinforcement learning from images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  10674–10681, 2021b.
  89. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pp.  1094–1100. PMLR, 2020.
  90. Pre-trained image encoder for generalizable visual reinforcement learning. arXiv preprint arXiv:2212.08860, 2022.
  91. Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning, pp.  537–546. PMLR, 2022.
  92. Colorful image colorization. In European conference on computer vision, pp.  649–666. Springer, 2016.
  93. ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
  94. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020.
Citations (17)

Summary

We haven't generated a summary for this paper yet.