RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective (2404.12281v3)
Abstract: Precise robot manipulations require rich spatial information in imitation learning. Image-based policies model object positions from fixed cameras, which are sensitive to camera view changes. Policies utilizing 3D point clouds usually predict keyframes rather than continuous actions, posing difficulty in dynamic and contact-rich scenarios. To utilize 3D perception efficiently, we present RISE, an end-to-end baseline for real-world imitation learning, which predicts continuous actions directly from single-view point clouds. It compresses the point cloud to tokens with a sparse 3D encoder. After adding sparse positional encoding, the tokens are featurized using a transformer. Finally, the features are decoded into robot actions by a diffusion head. Trained with 50 demonstrations for each real-world task, RISE surpasses currently representative 2D and 3D policies by a large margin, showcasing significant advantages in both accuracy and efficiency. Experiments also demonstrate that RISE is more general and robust to environmental change compared with previous baselines. Project website: rise-policy.github.io.
- “RT-1: Robotics Transformer for Real-World Control at Scale” In Robotics: Science and Systems, 2023
- “PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation” In Conference on Robot Learning, 2023, pp. 1761–1781 PMLR
- “Multi-View 3D Object Detection Network for Autonomous Driving” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1907–1915
- “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion” In Robotics: Science and Systems, 2023
- Christopher Choy, JunYoung Gwak and Silvio Savarese “4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3075–3084
- “ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5828–5839
- “Reinforcement Learning with Neural Radiance Fields” In Advances in Neural Information Processing Systems 35, 2022, pp. 16931–16945
- “AnyGrasp: Robust and Efficient Grasp Perception in Spatial and Temporal Domains” In IEEE Transactions on Robotics IEEE, 2023
- “RH20T: A Robotic Dataset for Learning Diverse Skills in One-Shot” In RSS 2023 Workshop on Learning for Task and Motion Planning, 2023
- “Low-cost exoskeletons for learning whole-arm manipulation in the wild” In IEEE International Conference on Robotics and Automation, 2024
- “Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation” In Conference on Robot Learning, 2023, pp. 3949–3965 PMLR
- “RVT: Robotic View Transformer for 3D Object Manipulation” In Conference on Robot Learning, 2023, pp. 694–710 PMLR
- Benjamin Graham, Martin Engelcke and Laurens Van Der Maaten “3D Semantic Segmentation with Submanifold Sparse Convolutional Networks” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9224–9232
- “Instruction-Driven History-Aware Policies for Robotic Manipulations” In Conference on Robot Learning, 2022, pp. 175–187 PMLR
- Huy Ha, Pete Florence and Shuran Song “Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition” In Conference on Robot Learning, 2023, pp. 3766–3777 PMLR
- Abdullah Hamdi, Silvio Giancola and Bernard Ghanem “MVTN: Multi-View Transformation Network for 3D Shape Recognition” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1–11
- “Deep Residual Learning for Image Recognition” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778
- Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising Diffusion Probabilistic Models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
- “Coarse-to-Fine Q-Attention: Efficient Learning for Visual Robotic Manipulation via Discretisation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13739–13748
- “RLBench: The Robot Learning Benchmark & Learning Environment” In IEEE Robotics and Automation Letters 5.2 IEEE, 2020, pp. 3019–3026
- “BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning” In Conference on Robot Learning PMLR, 2021, pp. 991–1002
- “Planning with Diffusion for Flexible Behavior Synthesis” In International Conference on Machine Learning, 2022, pp. 9902–9915 PMLR
- Bo Li, Tianlei Zhang and Tian Xia “Vehicle Detection from 3D Lidar Using Fully Convolutional Network” In Robotics: Science and Systems, 2016
- “LIV: Language-Image Representations and Rewards for Robotic Control” In International Conference on Machine Learning, 2023, pp. 23301–23320 PMLR
- “VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training” In International Conference on Learning Representations, 2023
- “Where are We in the Search for an Artificial Visual Cortex for Embodied Intelligence?” In ICRA 2023 Workshop on Pretraining for Robotics, 2023
- “RoboTurk: A Crowdsourcing Platform for Robotic Skill Learning through Imitation” In Conference on Robot Learning, 2018, pp. 879–893 PMLR
- “What Matters in Learning from Offline Human Demonstrations for Robot Manipulation” In Conference on Robot Learning, 2021, pp. 1678–1690 PMLR
- “CALVIN: A Benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks” In IEEE Robotics and Automation Letters 7.3 IEEE, 2022, pp. 7327–7334
- “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis” In Communications of the ACM 65.1 ACM New York, NY, USA, 2021, pp. 99–106
- “R3M: A Universal Visual Representation for Robot Manipulation” In Conference on Robot Learning, 2022, pp. 892–909 PMLR
- “Open X-Embodiment: Robotic Learning Datasets and RT-X Models” In arXiv preprint arXiv:2310.08864, 2023
- “3D Object Detection with PointFormer” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7463–7472
- “The Surprising Effectiveness of Representation Learning for Visual Imitation” In Robotics: Science and Systems, 2022
- Dean A Pomerleau “ALVINN: An Autonomous Land Vehicle in a Neural Network” In Advances in Neural Information Processing Systems 1, 1988
- “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660
- “PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies” In Advances in Neural Information Processing Systems 35, 2022, pp. 23192–23204
- “Real-World Robot Learning with Masked Visual Pre-Training” In Conference on Robot Learning, 2022, pp. 416–426 PMLR
- “Vision-Based Multi-Task Manipulation for Inexpensive Robots using End-to-End Learning from Demonstration” In IEEE International Conference on Robotics and Automation, 2018, pp. 3758–3765 IEEE
- “A Generalist Agent” In Transactions on Machine Learning Research, 2022
- Gernot Riegler, Ali Osman Ulusoy and Andreas Geiger “OctNet: Learning Deep 3D Representations at High Resolutions” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3577–3586
- “Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation” In Conference on Robot Learning, 2023, pp. 405–424 PMLR
- Mohit Shridhar, Lucas Manuelli and Dieter Fox “Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation” In Conference on Robot Learning, 2022, pp. 785–799 PMLR
- Kihyuk Sohn, Honglak Lee and Xinchen Yan “Learning Structured Output Representation using Deep Conditional Generative Models” In Advances in Neural Information Processing Systems 28, 2015
- Jiaming Song, Chenlin Meng and Stefano Ermon “Denoising Diffusion Implicit Models” In The International Conference on Learning Representations, 2021
- “Contact-Graspnet: Efficient 6-DoF Grasp Generation in Cluttered Scenes” In IEEE International Conference on Robotics and Automation, 2021, pp. 13438–13444 IEEE
- “Octo: An Open-Source Generalist Robot Policy”, 2023
- “Attention is All You Need” In Advances in Neural Information Processing Systems 30, 2017
- “O-CNN: Octree-Based Convolutional Neural Networks for 3D Shape Analysis” In ACM Transactions On Graphics 36.4 ACM New York, NY, USA, 2017, pp. 1–11
- “3D ShapeNets: A Deep Representation for Volumetric Shapes” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1912–1920
- “ChainedDiffuser: Unifying Trajectory Diffusion and Keypose Prediction for Robotic Manipulation” In Conference on Robot Learning, 2023, pp. 2323–2339 PMLR
- Jianglong Ye, Naiyan Wang and Xiaolong Wang “FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8962–8973
- “3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations” In arXiv preprint arXiv:2403.03954, 2024
- “GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields” In Conference on Robot Learning, 2023, pp. 284–301 PMLR
- “Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation” In IEEE International Conference on Robotics and Automation, 2018, pp. 5628–5635 IEEE
- “Point Transformer” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16259–16268
- “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware” In Robotics: Science and Systems, 2023
- “On the Continuity of Rotation Representations in Neural Networks” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5745–5753
- “VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490–4499
- “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control” In Conference on Robot Learning, 2023, pp. 2165–2183
- Chenxi Wang (66 papers)
- Hongjie Fang (17 papers)
- Hao-Shu Fang (38 papers)
- Cewu Lu (203 papers)