Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds (2404.12440v1)
Abstract: In recent years, modern techniques in deep learning and large-scale datasets have led to impressive progress in 3D instance segmentation, grasp pose estimation, and robotics. This allows for accurate detection directly in 3D scenes, object- and environment-aware grasp prediction, as well as robust and repeatable robotic manipulation. This work aims to integrate these recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments. Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation, alongside grasp pose estimation, to demonstrate dynamic picking of objects, and opening of drawers. We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening, reporting a 51% and 82% success rate respectively. Code of our framework as well as videos are available on: https://spot-compose.github.io/.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in International Conference on Computer Vision (ICCV), 2017.
- K. Genova, X. Yin, A. Kundu, C. Pantofaru, F. Cole, A. Sud, B. Brewington, B. Shucker, and T. Funkhouser, “Learning 3D Semantic Segmentation with only 2D Image Supervision,” in Conference on 3D Vision (3DV), 2021.
- J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3D: Mask Transformer for 3D Semantic Instance Segmentation,” in International Conference on Robotics and Automation (ICRA), 2023.
- J. Sun, C. Qing, J. Tan, and X. Xu, “Superpoint Transformer for 3D Scene Instance Segmentation,” in Association for the Advancement of Artificial Intelligence (AAAI), 2023.
- R. Huang, S. Peng, A. Takmaz, F. Tombari, M. Pollefeys, S. Song, G. Huang, and F. Engelmann, “Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels,” arXiv, 2023.
- L. Kreuzberg, I. Zulfikar, S. Mahadevan, F. Engelmann, and B. Leibe, “4D-StOP: Panoptic Segmentation of 4D LiDAR using Spatio-temporal Object Proposal Generation and Aggregation,” in European Conference on Computer Vision (ECCV) Workshops, 2022.
- H. Lei, N. Akhtar, and A. Mian, “Spherical Kernel for Efficient Graph Convolution on 3D Point Clouds,” Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2020.
- Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2016.
- Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “PointCNN: Convolution on X-Transformed Points,” International Conference on Neural Information Processing Systems (NeurIPS), 2018.
- X. Wang, S. Liu, X. Shen, C. Shen, and J. Jia, “Associatively Segmenting Instances and Semantics in Point Clouds,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “OpenMask3D: Open-Vocabulary 3D Instance Segmentation,” in International Conference on Neural Information Processing Systems (NeurIPS), 2023.
- J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. D. Mello, “Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- F. Engelmann, F. Manhardt, M. Niemeyer, M. P. Keisuke Tateno, and F. Tombari, “OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views,” in International Conference on Learning Representations (ICLR), 2024.
- X. Chen, S. Li, S. N. Lim, A. Torralba, and H. Zhao, “Open-vocabulary Panoptic Segmentation with Embedding Modulation,” in International Conference on Computer Vision (ICCV), 2023.
- S. Peng, K. Genova, C. M. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “OpenScene: 3D Scene Understanding with Open Vocabularies,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning Transferable Visual Models from Natural Language Supervision,” in International Conference on Machine Learning (ICML), 2021.
- X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid Loss for Language Image Pre-training,” in International Conference on Computer Vision (ICCV), 2023.
- M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari, “SILC: Improving Vision Language Pretraining With Self-Distillation,” arXiv preprint arXiv:2310.13355, 2023.
- S. Kumra, S. Joshi, and F. Sahin, “Antipodal Robotic Grasping Using Generative Residual Convolutional Neural Network,” in Conference on Intelligent Robots and Systems (IROS), 2020.
- A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, and E. Romo, “Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching,” Journal of Robotics Research (IJRR), 2022.
- A. Hundt, V. Jain, C.-H. Lin, C. Paxton, and G. D. Hager, “The Costar Block Stacking Dataset: Learning with Workspace Constraints,” in Conference on Intelligent Robots and Systems (IROS), 2019.
- K. Karunratanakul, J. Yang, Y. Zhang, M. J. Black, K. Muandet, and S. Tang, “Grasping Field: Learning Implicit Representations for Human Grasps,” in Conference on 3D Vision (3DV), 2020.
- S. Ainetter and F. Fraundorfer, “End-to-end Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB,” in International Conference on Robotics and Automation (ICRA), 2021.
- F.-J. Chu, R. Xu, and P. A. Vela, “Real-World Multiobject, Multigrasp Detection,” Robotics and Automation Letters (RAL), 2018.
- R. Zurbrügg, Y. Liu, F. Engelmann, S. Kumar, M. Hutter, V. Patil, and F. Yu, “ICGNet: A Unified Approach for Instance-Centric Grasping,” in International Conference on Robotics and Automation (ICRA), 2024.
- A. Ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp Pose Detection in Point Clouds,” Journal of Robotics Research (IJRR), 2017.
- A. Mousavian, C. Eppner, and D. Fox, “6-DoF GraspNet: Variational Grasp Generation for Object Manipulation,” in International Conference on Computer Vision (ICCV), 2019.
- H. Duan, P. Wang, Y. Huang, G. Xu, W. Wei, and X. Shen, “Robotics Dexterous Grasping: The Methods Based on Point Cloud and Deep Learning,” Frontiers in Neurorobotics, 2021.
- H.-S. Fang, C. Wang, M. Gou, and C. Lu, “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- D.-C. Hoang, J. A. Stork, and T. Stoyanov, “Voting and Attention-Based Pose Relation Learning for Object Pose Estimation From 3D Point Clouds,” Robotics and Automation Letters (RAL), 2022.
- H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y. Xie, and C. Lu, “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,” IEEE Transactions on Robotics, 2023.
- A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans,” Robotics: Science and Systems (RSS), 2020.
- M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y. Min, K. Shah, C. Paxton, S. Gupta, D. Batra et al., “Goat: Go to Any Thing,” arXiv preprint arXiv:2311.06430, 2023.
- M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as i Can, Not as I Say: Grounding Language in Robotic Affordances,” arXiv preprint arXiv:2204.01691, 2022.
- N. Yokoyama, A. W. Clegg, E. Undersander, S. Ha, D. Batra, and A. Rai, “Adaptive Skill Coordination for Robotic Mobile Manipulation,” arXiv preprint arXiv:2304.00410, 2023.
- P. Liu, Y. Orru, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “Ok-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics,” arXiv preprint arXiv:2401.12202, 2024.
- Boston Dynamics, “Spot: The Agile Mobile Robot,” https://bostondynamics.com/products/spot/, 2023, accessed: 2024-03-10.
- ——, “Spot SDK Documentation,” https://dev.bostondynamics.com/, 2024, accessed: 2024-03-10.
- Laan Labs, “3D Scanner App,” https://apps.apple.com/us/app/3d-scanner-app/id1419913995, accessed: 2023-12-26.
- A. Delitzas, A. Takmaz, F. Tombari, R. Sumner, M. Pollefeys, and F. Engelmann, “SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLO,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics
- M. Arduengo, C. Torras, and L. Sentis, “Robust and Adaptive Door Operation with a Mobile Robot,” Intelligent Service Robotics, 2021.
- M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” 1981.
- A. Takmaz, J. Schult, I. Kaftan, M. Akcay, B. Leibe, R. Sumner, F. Engelmann, and S. Tang, “3D Segmentation of Humans in Point Clouds with Synthetic Data,” in International Conference on Computer Vision (ICCV), 2023.
- Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation,” in International Conference on Neural Information Processing Systems (NeurIPS), 2022.
- M. Minderer, A. Gritsenko, and N. Houlsby, “Scaling Open-Vocabulary Object Detection,” in International Conference on Neural Information Processing Systems (NeurIPS), 2023.