Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning (2407.15815v2)
Abstract: Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://gemcollector.github.io/maniwhere/.
- Transic: Sim-to-real policy transfer by learning from online correction. arXiv preprint arXiv: Arxiv-2405.10315, 2024.
- Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation. IEEE Robotics and Automation Letters, 7(2):3046–3053, 2022.
- Visual reinforcement learning with self-supervised 3d representations. IEEE Robotics and Automation Letters, 8(5):2890–2897, 2023.
- Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309, 2020.
- Multi-view masked world models for visual robotic manipulation. In International Conference on Machine Learning, pages 30613–30632. PMLR, 2023.
- Cyberdemo: Augmenting simulated human demonstration for real-world dexterous manipulation. arXiv preprint arXiv:2402.14795, 2024.
- Spin: Simultaneous perception, interaction and navigation. arXiv preprint arXiv:2405.07991, 2024.
- Dynamic handover: Throw and catch with bimanual hands. arXiv preprint arXiv:2309.05655, 2023.
- Stabilizing deep q-learning with convnets and vision transformers under data augmentation. Advances in Neural Information Processing Systems, 34, 2021.
- N. Hansen and X. Wang. Generalization in reinforcement learning by soft data augmentation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617. IEEE, 2021.
- Spectrum random masking for generalization in image-based reinforcement learning. Advances in Neural Information Processing Systems, 35:20393–20406, 2022.
- Don’t touch what matters: Task-aware lipschitz data augmentation for visual reinforcement learning. arXiv preprint arXiv:2202.09982, 2022a.
- Pre-trained image encoder for generalizable visual reinforcement learning. Advances in Neural Information Processing Systems, 35:13022–13037, 2022b.
- Look where you look! saliency-guided q-networks for generalization in visual reinforcement learning. Advances in Neural Information Processing Systems, 35:30693–30706, 2022.
- Generalizable visual reinforcement learning with segment anything model. arXiv preprint arXiv:2312.17116, 2023.
- A comprehensive survey of data augmentation in visual reinforcement learning. arXiv preprint arXiv:2210.04561, 2022.
- Spatial transformer networks. Advances in neural information processing systems, 28, 2015.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021.
- Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. arXiv preprint arXiv:2401.07487, 2024.
- A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36, 2024.
- Rl-vigen: A reinforcement learning benchmark for visual generalization. Advances in Neural Information Processing Systems, 36, 2024.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi:10.1109/IROS.2012.6386109.
- Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning. arXiv preprint arXiv:2309.06440, 2023.
- 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In Proceedings of Robotics: Science and Systems (RSS), 2024.
- Movie: Visual model-based policy adaptation for view generalization. Advances in Neural Information Processing Systems, 36, 2024.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
- Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
- Dextreme: Transfer of agile in-hand manipulation from simulation to reality. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5977–5984. IEEE, 2023.
- Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. arXiv preprint arXiv:2309.00987, 2023.
- Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018.
- Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024.
- The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191, 2024.
- Rvt: Robotic view transformer for 3d object manipulation. In Conference on Robot Learning, pages 694–710. PMLR, 2023.
- Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023.
- Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659, 2023.
- Vision-based manipulation from single human video with open-world object graphs. arXiv preprint arXiv:2405.20321, 2024.
- Normalization enhances generalization in visual reinforcement learning. arXiv preprint arXiv:2306.00656, 2023.
- Understanding what affects generalization gap in visual reinforcement learning: Theory and empirical evidence. arXiv preprint arXiv:2402.02701, 2024.
- Green screen augmentation enables scene generalisation in robotic manipulation. arXiv preprint arXiv:2407.07868, 2024.
- On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. arXiv preprint arXiv:2212.05749, 2022.
- Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023.
- Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning, pages 537–546. PMLR, 2022.
- Temporal cycle-consistency learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1801–1810, 2019.
- Natural language can help bridge the sim2real gap. In Robotics: Science and Systems (RSS), 2024, 2024.
- For pre-trained vision models in motor control, not all policy learning methods are created equal. In International Conference on Machine Learning, pages 13628–13651. PMLR, 2023.
- Robot learning with sensorimotor pre-training. In Conference on Robot Learning, pages 683–693. PMLR, 2023.
- Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on Robot Learning, pages 284–301. PMLR, 2023.
- Peac: Unsupervised pre-training for cross-embodiment reinforcement learning. arXiv preprint arXiv:2405.14073, 2024.
- Extraneousness-aware imitation learning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2973–2979. IEEE, 2023a.
- TACO: Temporal latent action-driven contrastive loss for visual reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=ezCsMOy1w9.
- Premier-TACO is a few-shot policy learner: Pretraining multitask representation via temporal action-driven contrastive loss. In arXiv, 2024.
- Visual representation learning with stochastic frame prediction. arXiv preprint arXiv:2406.07398, 2024.
- R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
- Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
- Learning manipulation by predicting interaction. In Proceedings of Robotics: Science and Systems (RSS), 2024.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- H-index: Visual reinforcement learning with hand-informed representations for dexterous manipulation. Advances in Neural Information Processing Systems, 36, 2024.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
- Zhecheng Yuan (18 papers)
- Tianming Wei (3 papers)
- Shuiqi Cheng (1 paper)
- Gu Zhang (33 papers)
- Yuanpei Chen (28 papers)
- Huazhe Xu (93 papers)