Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalizable Visual Reinforcement Learning with Segment Anything Model (2312.17116v1)

Published 28 Dec 2023 in cs.LG, cs.CV, and cs.RO

Abstract: Learning policies that can generalize to unseen environments is a fundamental challenge in visual reinforcement learning (RL). While most current methods focus on acquiring robust visual representations through auxiliary supervision, pre-training, or data augmentation, the potential of modern vision foundation models remains underleveraged. In this work, we introduce Segment Anything Model for Generalizable visual RL (SAM-G), a novel framework that leverages the promptable segmentation ability of Segment Anything Model (SAM) to enhance the generalization capabilities of visual RL agents. We utilize image features from DINOv2 and SAM to find correspondence as point prompts to SAM, and then SAM produces high-quality masked images for agents directly. Evaluated across 8 DMControl tasks and 3 Adroit tasks, SAM-G significantly improves the visual generalization ability without altering the RL agents' architecture but merely their observations. Notably, SAM-G achieves 44% and 29% relative improvements on the challenging video hard setting on DMControl and Adroit respectively, compared to state-of-the-art methods. Video and code: https://yanjieze.com/SAM-G/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Object permanence in five-month-old infants. Cognition, 1985.
  2. Look where you look! saliency-guided q-networks for generalization in visual reinforcement learning. NeurIPS, 2022.
  3. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. arXiv, 2022.
  4. Segment and track anything. arXiv, 2023.
  5. Quantifying generalization in reinforcement learning. In International Conference on Machine Learning, pages 1282–1289. PMLR, 2019.
  6. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv, 2020.
  8. Learning task informed abstractions. In ICML, 2021.
  9. Generalization in reinforcement learning by soft data augmentation. In ICRA, 2021.
  10. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. NeurIPS, 2021.
  11. On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. ICML, 2023.
  12. Mask r-cnn, 2018.
  13. Adam: A method for stochastic optimization. ICLR, 2015.
  14. Segment anything. ICCV, 2023.
  15. Playing atari with deep reinforcement learning. arXiv, 2013.
  16. Human-level control through deep reinforcement learning. Nature, 2015.
  17. Dinov2: Learning robust visual features without supervision. arXiv, 2023.
  18. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv, 2017.
  19. Segment anything meets point tracking. arXiv, 2023.
  20. Rrl: Resnet as representation for reinforcement learning. ICML, 2021.
  21. Deepmind control suite. arXiv, 2018.
  22. Vrl3: A data-driven framework for visual deep reinforcement learning. NeurIPS, 2022.
  23. Unsupervised visual attention and invariance for reinforcement learning. In CVPR, 2021.
  24. Efficientsam: Leveraged masked image pretraining for efficient segment anything. arXiv, 2023.
  25. Learning vision-guided quadrupedal locomotion end-to-end with cross-modal transformers. arXiv, 2021.
  26. Neural volumetric memory for visual locomotion control. In CVPR, 2023a.
  27. Movie: Visual model-based policy adaptation for view generalization. arXiv, 2023b.
  28. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv, 2021.
  29. Don’t touch what matters: Task-aware lipschitz data augmentation for visual reinforcement learning. arXiv, 2022a.
  30. Pre-trained image encoder for generalizable visual reinforcement learning. NeurIPS, 2022b.
  31. Rl-vigen: A reinforcement learning benchmark for visual generalization. arXiv, 2023.
  32. Visual reinforcement learning with self-supervised 3d representations. IEEE Robotics and Automation Letters, 2023a.
  33. H-index: Visual reinforcement learning with hand-informed representations for dexterous manipulation. NeurIPS, 2023b.
  34. A dissection of overfitting and generalization in continuous reinforcement learning. arXiv, 2018.
  35. Faster segment anything: Towards lightweight sam for mobile applications. arXiv, 2023a.
  36. Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023b.
  37. Investigating generalisation in continuous deep reinforcement learning. arXiv, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ziyu Wang (137 papers)
  2. Yanjie Ze (20 papers)
  3. Yifei Sun (70 papers)
  4. Zhecheng Yuan (18 papers)
  5. Huazhe Xu (93 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.