Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning (2407.15815v2)

Published 22 Jul 2024 in cs.RO, cs.AI, and cs.CV

Abstract: Can we endow visuomotor robots with generalization capabilities to operate in diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a generalizable framework tailored for visual reinforcement learning, enabling the trained robot policies to generalize across a combination of multiple visual disturbance types. Specifically, we introduce a multi-view representation learning approach fused with Spatial Transformer Network (STN) module to capture shared semantic information and correspondences among different viewpoints. In addition, we employ a curriculum-based randomization and augmentation approach to stabilize the RL training process and strengthen the visual generalization ability. To exhibit the effectiveness of Maniwhere, we meticulously design 8 tasks encompassing articulate objects, bi-manual, and dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual generalization and sim2real transfer abilities across 3 hardware platforms. Our experiments show that Maniwhere significantly outperforms existing state-of-the-art methods. Videos are provided at https://gemcollector.github.io/maniwhere/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Transic: Sim-to-real policy transfer by learning from online correction. arXiv preprint arXiv: Arxiv-2405.10315, 2024.
  2. Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation. IEEE Robotics and Automation Letters, 7(2):3046–3053, 2022.
  3. Visual reinforcement learning with self-supervised 3d representations. IEEE Robotics and Automation Letters, 8(5):2890–2897, 2023.
  4. Self-supervised policy adaptation during deployment. arXiv preprint arXiv:2007.04309, 2020.
  5. Multi-view masked world models for visual robotic manipulation. In International Conference on Machine Learning, pages 30613–30632. PMLR, 2023.
  6. Cyberdemo: Augmenting simulated human demonstration for real-world dexterous manipulation. arXiv preprint arXiv:2402.14795, 2024.
  7. Spin: Simultaneous perception, interaction and navigation. arXiv preprint arXiv:2405.07991, 2024.
  8. Dynamic handover: Throw and catch with bimanual hands. arXiv preprint arXiv:2309.05655, 2023.
  9. Stabilizing deep q-learning with convnets and vision transformers under data augmentation. Advances in Neural Information Processing Systems, 34, 2021.
  10. N. Hansen and X. Wang. Generalization in reinforcement learning by soft data augmentation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13611–13617. IEEE, 2021.
  11. Spectrum random masking for generalization in image-based reinforcement learning. Advances in Neural Information Processing Systems, 35:20393–20406, 2022.
  12. Don’t touch what matters: Task-aware lipschitz data augmentation for visual reinforcement learning. arXiv preprint arXiv:2202.09982, 2022a.
  13. Pre-trained image encoder for generalizable visual reinforcement learning. Advances in Neural Information Processing Systems, 35:13022–13037, 2022b.
  14. Look where you look! saliency-guided q-networks for generalization in visual reinforcement learning. Advances in Neural Information Processing Systems, 35:30693–30706, 2022.
  15. Generalizable visual reinforcement learning with segment anything model. arXiv preprint arXiv:2312.17116, 2023.
  16. A comprehensive survey of data augmentation in visual reinforcement learning. arXiv preprint arXiv:2210.04561, 2022.
  17. Spatial transformer networks. Advances in neural information processing systems, 28, 2015.
  18. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  19. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021.
  20. Robo-abc: Affordance generalization beyond categories via semantic correspondence for robot manipulation. arXiv preprint arXiv:2401.07487, 2024.
  21. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. Advances in Neural Information Processing Systems, 36, 2024.
  22. Rl-vigen: A reinforcement learning benchmark for visual generalization. Advances in Neural Information Processing Systems, 36, 2024.
  23. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  24. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033. IEEE, 2012. doi:10.1109/IROS.2012.6386109.
  25. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning. arXiv preprint arXiv:2309.06440, 2023.
  26. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. In Proceedings of Robotics: Science and Systems (RSS), 2024.
  27. Movie: Visual model-based policy adaptation for view generalization. Advances in Neural Information Processing Systems, 36, 2024.
  28. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  29. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
  30. Dextreme: Transfer of agile in-hand manipulation from simulation to reality. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5977–5984. IEEE, 2023.
  31. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. arXiv preprint arXiv:2309.00987, 2023.
  32. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE international conference on robotics and automation (ICRA), pages 1134–1141. IEEE, 2018.
  33. Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941, 2024.
  34. The colosseum: A benchmark for evaluating generalization for robotic manipulation. arXiv preprint arXiv:2402.08191, 2024.
  35. Rvt: Robotic view transformer for 3d object manipulation. In Conference on Robot Learning, pages 694–710. PMLR, 2023.
  36. Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023.
  37. Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659, 2023.
  38. Vision-based manipulation from single human video with open-world object graphs. arXiv preprint arXiv:2405.20321, 2024.
  39. Normalization enhances generalization in visual reinforcement learning. arXiv preprint arXiv:2306.00656, 2023.
  40. Understanding what affects generalization gap in visual reinforcement learning: Theory and empirical evidence. arXiv preprint arXiv:2402.02701, 2024.
  41. Green screen augmentation enables scene generalisation in robotic manipulation. arXiv preprint arXiv:2407.07868, 2024.
  42. On pre-training for visuo-motor control: Revisiting a learning-from-scratch baseline. arXiv preprint arXiv:2212.05749, 2022.
  43. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023.
  44. Xirl: Cross-embodiment inverse reinforcement learning. In Conference on Robot Learning, pages 537–546. PMLR, 2022.
  45. Temporal cycle-consistency learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1801–1810, 2019.
  46. Natural language can help bridge the sim2real gap. In Robotics: Science and Systems (RSS), 2024, 2024.
  47. For pre-trained vision models in motor control, not all policy learning methods are created equal. In International Conference on Machine Learning, pages 13628–13651. PMLR, 2023.
  48. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, pages 683–693. PMLR, 2023.
  49. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on Robot Learning, pages 284–301. PMLR, 2023.
  50. Peac: Unsupervised pre-training for cross-embodiment reinforcement learning. arXiv preprint arXiv:2405.14073, 2024.
  51. Extraneousness-aware imitation learning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2973–2979. IEEE, 2023a.
  52. TACO: Temporal latent action-driven contrastive loss for visual reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=ezCsMOy1w9.
  53. Premier-TACO is a few-shot policy learner: Pretraining multitask representation via temporal action-driven contrastive loss. In arXiv, 2024.
  54. Visual representation learning with stochastic frame prediction. arXiv preprint arXiv:2406.07398, 2024.
  55. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
  56. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
  57. Learning manipulation by predicting interaction. In Proceedings of Robotics: Science and Systems (RSS), 2024.
  58. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  59. H-index: Visual reinforcement learning with hand-informed representations for dexterous manipulation. Advances in Neural Information Processing Systems, 36, 2024.
  60. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhecheng Yuan (18 papers)
  2. Tianming Wei (3 papers)
  3. Shuiqi Cheng (1 paper)
  4. Gu Zhang (33 papers)
  5. Yuanpei Chen (28 papers)
  6. Huazhe Xu (93 papers)
Citations (9)

Summary

A Visual Generalizable Framework for Reinforcement Learning

The paper "Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning" by Zhecheng Yuan et al. presents Maniwhere, a novel framework designed to improve the generalizability of visuomotor robots in diverse open-world scenarios. The core focus of Maniwhere is to enable reinforcement learning (RL) agents to perform consistently across various visual conditions without requiring camera recalibration, which is often a significant obstacle in real-world deployments of robotic policies.

Methodology

Maniwhere centers around several key methodologies to achieve its high level of generalizability:

  1. Multi-View Representation Learning: Maniwhere incorporates a multi-view representation objective, leveraging images from fixed and moving viewpoints to extract invariant features. This approach utilizes InfoNCE-based contrastive loss (Equation 1) to align representations from different viewpoints and an additional alignment loss (Equation 2) to ensure feature map consistency across views.
  2. Spatial Transformer Network (STN): The framework integrates a Spatial Transformer Network within the visual encoder to handle variations in camera perspectives. The STN module performs perspective transformations, enhancing the model's robustness to spatial changes in the visual inputs.
  3. Curriculum-Based Domain Randomization: To stabilize the RL training and maintain the efficacy of domain randomization, Maniwhere employs a curriculum-based approach. This method gradually increases the magnitude of randomization parameters throughout training, thereby preventing divergence and enabling effective sim2real transfer.

Experimental Setup and Results

Maniwhere was rigorously evaluated across eight distinct tasks involving a variety of robotic embodiments, including single and bi-manual arms, dexterous hand manipulation, and the handling of articulated objects. The framework was benchmarked against multiple state-of-the-art baselines: SRM, PIE-G, SGQN, MoVie, and MV-MWM.

The findings illustrate that Maniwhere substantially outperforms these baselines in both simulation and real-world tests:

  • Simulation Results: Maniwhere demonstrated superior generalization across different viewpoints and visual appearances, maintaining high success rates despite variations. Table 1 shows a +68.5% boost in average performance compared to the leading baselines.
  • Real-World Performance: The framework was tested in real-world conditions with three types of robotic arms and two types of dexterous hands. Results indicated a strong zero-shot sim2real transferability (Table 3), with significant performance margins over competitors.
  • Cross-Embodiment Generalization: Maniwhere was also adept at transferring learned skills across different robotic embodiments, showcasing its versatility and robustness.

Ablation Studies

The paper includes comprehensive ablation experiments to identify the impact of key components such as the multi-view representation learning objective and the STN module. The results (Table 4) highlight the critical role of multi-view learning in achieving viewpoint invariance, and the effectiveness of the STN module in enhancing spatial awareness.

Implications and Future Directions

The theoretical and practical implications of Maniwhere are significant:

  • Practical: The ability to generalize across various visual conditions without camera recalibration can drastically reduce the deployment time and costs in real-world robotic applications.
  • Theoretical: The integration of multi-view representation learning with spatial transformation and curriculum randomization provides a new paradigm for addressing the sim2real gap in visual RL.

Future work could explore extending Maniwhere for more complex, long-horizon manipulation tasks, and investigating its applications in mobile manipulation scenarios.

Conclusion

Maniwhere represents a robust and versatile framework for enhancing the visual generalization capabilities of RL agents. By combining multi-view representation learning, spatial transformations, and curriculum-based randomization, it sets a new benchmark in zero-shot sim2real transfer for visuomotor control tasks. The framework's significant performance improvements over existing methods highlight its potential for real-world robotic applications, paving the way for more adaptive and resilient AI systems in dynamic environments.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub