RVT: Robotic View Transformer for 3D Object Manipulation (2306.14896v1)
Abstract: For 3D object manipulation, methods that build an explicit 3D representation perform better than those relying only on camera images. But using explicit 3D representations like voxels comes at large computing cost, adversely affecting scalability. In this work, we propose RVT, a multi-view transformer for 3D manipulation that is both scalable and accurate. Some key features of RVT are an attention mechanism to aggregate information across views and re-rendering of the camera input from virtual views around the robot workspace. In simulations, we find that a single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than the existing state-of-the-art method (PerAct). It also trains 36X faster than PerAct for achieving the same performance and achieves 2.3X the inference speed of PerAct. Further, RVT can perform a variety of manipulation tasks in the real world with just a few ($\sim$10) demonstrations per task. Visual results, code, and trained model are provided at https://robotic-view-transformer.github.io/.
- Ifor: Iterative flow minimization for robotic object rearrangement. In arXiv:2202.00732, 2022.
- BC-z: Zero-shot task generalization with robotic imitation learning. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=8kbp23tSGYv.
- Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL), 2021.
- Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021.
- Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13739–13748, 2022.
- Perceiver-Actor: A multi-task transformer for robotic manipulation. In CoRL, 2022.
- Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
- Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
- Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26):eaau5872, 2019.
- Rma: Rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034, 2021.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Learning to fly: computational controller design for hybrid uavs with reinforcement learning. ACM Transactions on Graphics (TOG), 38(4):1–12, 2019.
- On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
- A. Goyal and J. Deng. Packit: A virtual environment for geometric planning. In International Conference on Machine Learning, pages 3700–3710. PMLR, 2020.
- Ving: Learning open-world navigation with visual goals. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13215–13222. IEEE, 2021.
- Learning vision-guided quadrupedal locomotion end-to-end with cross-modal transformers. arXiv preprint arXiv:2107.03996, 2021.
- Generalization in dexterous manipulation via geometry-aware multi-task learning. arXiv preprint arXiv:2111.03062, 2021.
- A system for general in-hand object re-orientation. Conference on Robot Learning, 2021.
- Visual-locomotion: Learning to walk on complex terrains with vision. In 5th Annual Conference on Robot Learning, 2021. URL https://openreview.net/forum?id=NDYbXf-DvwZ.
- Offline reinforcement learning for visual navigation. In 6th Annual Conference on Robot Learning, 2022. URL https://openreview.net/forum?id=uhIfIEIiWm_.
- Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019.
- Embed to control: A locally linear latent dynamics model for control from raw images. Advances in neural information processing systems, 28, 2015.
- Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=GY6-6sTvGaf.
- Mastering visual continuous control: Improved data-augmented reinforcement learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=_SJ-_yyes8.
- Rt-1: Robotics transformer for real-world control at scale. In arXiv preprint arXiv:2212.06817, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Mira: Mental imagery for robotic affordances. In K. Liu, D. Kulic, and J. Ichnowski, editors, Proceedings of The 6th Conference on Robot Learning, volume 205 of Proceedings of Machine Learning Research, pages 1916–1927. PMLR, 14–18 Dec 2023.
- Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
- Robotic grasping of novel objects. Advances in neural information processing systems, 19, 2006.
- Multi-task policy search for robotics. In 2014 IEEE international conference on robotics and automation (ICRA), pages 3876–3881. IEEE, 2014.
- Mt-opt: Continuous multi-task robotic reinforcement learning at scale. arXiv preprint arXiv:2104.08212, 2021.
- Learning modular neural network policies for multi-task and multi-robot transfer. In 2017 IEEE international conference on robotics and automation (ICRA), pages 2169–2176. IEEE, 2017.
- Distral: Robust multitask reinforcement learning. Advances in neural information processing systems, 30, 2017.
- Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
- Rel3d: A minimally contrastive benchmark for grounding spatial relations in 3d. Advances in Neural Information Processing Systems, 33:10514–10525, 2020.
- C. Lynch and P. Sermanet. Language conditioned imitation learning over unstructured data. arXiv preprint arXiv:2005.07648, 2020.
- What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters, 7(4):11205–11212, 2022.
- Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Progprompt: Generating situated robot task plans using large language models. ICRA, 2022.
- Differentiable spatial planning using transformers. In International Conference on Machine Learning, pages 1484–1495. PMLR, 2021.
- Assistive tele-op: Leveraging transformers to collect robotic task demonstrations. arXiv preprint arXiv:2112.05129, 2021.
- Motion planning transformers: One model to plan them all. arXiv preprint arXiv:2106.02791, 2021.
- Transformer-based meta-imitation learning for robotic manipulation. In Neural Information Processing Systems, Workshop on Robot Learning, 2020.
- Transformer-based deep imitation learning for dual-arm robot manipulation. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8965–8972. IEEE, 2021.
- S. Dasari and A. Gupta. Transformers for one-shot visual imitation. In Conference on Robot Learning, pages 2071–2084. PMLR, 2021.
- Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation. IEEE Robotics and Automation Letters, 7(2):3046–3053, 2022.
- Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects. In 2022 International Conference on Robotics and Automation (ICRA), pages 6322–6329. IEEE, 2022.
- Revisiting point cloud shape classification with a simple and effective baseline. In International Conference on Machine Learning, pages 3809–3820. PMLR, 2021.
- Mvtn: Multi-view transformation network for 3d shape recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2021a.
- Voint cloud: Multi-view point cloud representation for 3d understanding. arXiv preprint arXiv:2111.15363, 2021b.
- Multi-view transformer for 3d visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15524–15533, 2022.
- Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
- E. Johns. Coarse-to-fine imitation learning: Robot manipulation from a single demonstration. In 2021 IEEE international conference on robotics and automation (ICRA), pages 4613–4619. IEEE, 2021.
- Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501, 2020.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- V-rep: A versatile and scalable robot simulation framework. In 2013 IEEE/RSJ international conference on intelligent robots and systems, pages 1321–1326. IEEE, 2013.
- G. Sánchez and J.-C. Latombe. A single-query bi-directional probabilistic roadmap planner with lazy collision checking. In Int. Symp. Robotics Research, pages 403–417, 2001.
- The open motion planning library. IEEE Robotics & Automation Magazine, 19(4):72–82, 2012.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- A modular robotic arm control stack for research: Franka-Interface and FrankaPy. arXiv preprint arXiv:2011.02398, 2020.