Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation (2306.17817v2)
Abstract: 3D perceptual representations are well suited for robot manipulation as they easily encode occlusions and simplify spatial reasoning. Many manipulation tasks require high spatial precision in end-effector pose prediction, which typically demands high-resolution 3D feature grids that are computationally expensive to process. As a result, most manipulation policies operate directly in 2D, foregoing 3D inductive biases. In this paper, we introduce Act3D, a manipulation policy transformer that represents the robot's workspace using a 3D feature field with adaptive resolutions dependent on the task at hand. The model lifts 2D pre-trained features to 3D using sensed depth, and attends to them to compute features for sampled 3D points. It samples 3D point grids in a coarse to fine manner, featurizes them using relative-position attention, and selects where to focus the next round of point sampling. In this way, it efficiently computes 3D action maps of high spatial resolution. Act3D sets a new state-of-the-art in RL-Bench, an established manipulation benchmark, where it achieves 10% absolute improvement over the previous SOTA 2D multi-view policy on 74 RLBench tasks and 22% absolute improvement with 3x less compute over the previous SOTA 3D policy. We quantify the importance of relative spatial attention, large-scale vision-language pre-trained 2D backbones, and weight tying across coarse-to-fine attentions in ablative experiments. Code and videos are available on our project website: https://act3d.github.io/.
- Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
- Instruction-driven history-aware policies for robotic manipulations. In Conference on Robot Learning, pages 175–187. PMLR, 2023.
- Instruction-following agents with jointly pre-trained vision-language models. arXiv preprint arXiv:2210.13431, 2022.
- Coarse-to-fine q-attention: Efficient learning for visual robotic manipulation via discretisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13739–13748, 2022.
- Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
- Graph-structured visual imitation. In Conference on Robot Learning, pages 979–989. PMLR, 2020.
- Sample efficient grasp learning using equivariant models, 2022.
- Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on Robot Learning, pages 726–747. PMLR, 2021.
- 3d-oes: Viewpoint-invariant object-factorized environment simulators. arXiv preprint arXiv:2011.06464, 2020.
- B. Graham. Sparse 3d convolutional neural networks, 2015.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks, 2019.
- Perceiver: General perception with iterative attention, 2021.
- Auto-lambda: Disentangling dynamic task relationships. arXiv preprint arXiv:2202.03091, 2022.
- Spatial-language attention policies for efficient robot learning. arXiv preprint arXiv:2304.11235, 2023.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Rlbench: The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020.
- Behavior transformers: Cloning k𝑘kitalic_k modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
- A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
- Bc-z: Zero-shot task generalization with robotic imitation learning. In Conference on Robot Learning, pages 991–1002. PMLR, 2022.
- Learning to rearrange deformable cables, fabrics, and bags with goal-conditioned transporter networks, 2021.
- Cliport: What and where pathways for robotic manipulation. In Conference on Robot Learning, pages 894–906. PMLR, 2022.
- Energy-based models as zero-shot planners for compositional scene rearrangement. arXiv preprint arXiv:2304.14391, 2023.
- R3m: A universal visual representation for robot manipulation, 2022.
- The unsurprising effectiveness of pre-trained vision models for control, 2022.
- Learning to see before learning to act: Visual pre-training for manipulation, 2021.
- Open-world object manipulation using pre-trained vision-language models, 2023.
- Hyperdynamics: Meta-learning object and agent dynamics with hypernetworks. arXiv preprint arXiv:2103.09439, 2021.
- Learning transferable visual models from natural language supervision, 2021.
- Learning spatial common sense with geometry-aware recurrent networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2595–2603, 2019.
- Learning from unlabelled videos using contrastive predictive neural 3d mapping. arXiv preprint arXiv:1906.03764, 2019.
- Self-attention with relative position representations, 2018.
- Swin transformer v2: Scaling up capacity and resolution, 2022.
- Roformer: Enhanced transformer with rotary position embedding, 2022.
- Point transformer v2: Grouped vector attention and partition-based pooling, 2022.
- Swin3d: A pretrained transformer backbone for 3d indoor scene understanding, 2023.
- S. James and A. J. Davison. Q-attention: Enabling efficient learning for vision-based robotic manipulation. IEEE Robotics and Automation Letters, 7(2):1612–1619, 2022.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Rrt-connect: An efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), volume 2, pages 995–1001. IEEE, 2000.
- The open motion planning library. IEEE Robotics & Automation Magazine, 19(4):72–82, 2012.
- Reducing the barrier to entry of complex robotic software: a moveit! case study. arXiv preprint arXiv:1404.3785, 2014.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Grounded decoding: Guiding text generation with grounded models for robot control. arXiv preprint arXiv:2303.00855, 2023.
- Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.
- Theophile Gervet (13 papers)
- Zhou Xian (17 papers)
- Nikolaos Gkanatsios (9 papers)
- Katerina Fragkiadaki (61 papers)