Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds (2404.12440v1)

Published 18 Apr 2024 in cs.RO and cs.CV

Abstract: In recent years, modern techniques in deep learning and large-scale datasets have led to impressive progress in 3D instance segmentation, grasp pose estimation, and robotics. This allows for accurate detection directly in 3D scenes, object- and environment-aware grasp prediction, as well as robust and repeatable robotic manipulation. This work aims to integrate these recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments. Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation, alongside grasp pose estimation, to demonstrate dynamic picking of objects, and opening of drawers. We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening, reporting a 51% and 82% success rate respectively. Code of our framework as well as videos are available on: https://spot-compose.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in International Conference on Computer Vision (ICCV), 2017.
  2. K. Genova, X. Yin, A. Kundu, C. Pantofaru, F. Cole, A. Sud, B. Brewington, B. Shucker, and T. Funkhouser, “Learning 3D Semantic Segmentation with only 2D Image Supervision,” in Conference on 3D Vision (3DV), 2021.
  3. J. Schult, F. Engelmann, A. Hermans, O. Litany, S. Tang, and B. Leibe, “Mask3D: Mask Transformer for 3D Semantic Instance Segmentation,” in International Conference on Robotics and Automation (ICRA), 2023.
  4. J. Sun, C. Qing, J. Tan, and X. Xu, “Superpoint Transformer for 3D Scene Instance Segmentation,” in Association for the Advancement of Artificial Intelligence (AAAI), 2023.
  5. R. Huang, S. Peng, A. Takmaz, F. Tombari, M. Pollefeys, S. Song, G. Huang, and F. Engelmann, “Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels,” arXiv, 2023.
  6. L. Kreuzberg, I. Zulfikar, S. Mahadevan, F. Engelmann, and B. Leibe, “4D-StOP: Panoptic Segmentation of 4D LiDAR using Spatio-temporal Object Proposal Generation and Aggregation,” in European Conference on Computer Vision (ECCV) Workshops, 2022.
  7. H. Lei, N. Akhtar, and A. Mian, “Spherical Kernel for Efficient Graph Convolution on 3D Point Clouds,” Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2020.
  8. Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2016.
  9. Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “PointCNN: Convolution on X-Transformed Points,” International Conference on Neural Information Processing Systems (NeurIPS), 2018.
  10. X. Wang, S. Liu, X. Shen, C. Shen, and J. Jia, “Associatively Segmenting Instances and Semantics in Point Clouds,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  11. A. Takmaz, E. Fedele, R. W. Sumner, M. Pollefeys, F. Tombari, and F. Engelmann, “OpenMask3D: Open-Vocabulary 3D Instance Segmentation,” in International Conference on Neural Information Processing Systems (NeurIPS), 2023.
  12. J. Xu, S. Liu, A. Vahdat, W. Byeon, X. Wang, and S. D. Mello, “Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  13. F. Engelmann, F. Manhardt, M. Niemeyer, M. P. Keisuke Tateno, and F. Tombari, “OpenNeRF: Open Set 3D Neural Scene Segmentation with Pixel-Wise Features and Rendered Novel Views,” in International Conference on Learning Representations (ICLR), 2024.
  14. X. Chen, S. Li, S. N. Lim, A. Torralba, and H. Zhao, “Open-vocabulary Panoptic Segmentation with Embedding Modulation,” in International Conference on Computer Vision (ICCV), 2023.
  15. S. Peng, K. Genova, C. M. Jiang, A. Tagliasacchi, M. Pollefeys, and T. Funkhouser, “OpenScene: 3D Scene Understanding with Open Vocabularies,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  16. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning Transferable Visual Models from Natural Language Supervision,” in International Conference on Machine Learning (ICML), 2021.
  17. X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid Loss for Language Image Pre-training,” in International Conference on Computer Vision (ICCV), 2023.
  18. M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari, “SILC: Improving Vision Language Pretraining With Self-Distillation,” arXiv preprint arXiv:2310.13355, 2023.
  19. S. Kumra, S. Joshi, and F. Sahin, “Antipodal Robotic Grasping Using Generative Residual Convolutional Neural Network,” in Conference on Intelligent Robots and Systems (IROS), 2020.
  20. A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, and E. Romo, “Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching,” Journal of Robotics Research (IJRR), 2022.
  21. A. Hundt, V. Jain, C.-H. Lin, C. Paxton, and G. D. Hager, “The Costar Block Stacking Dataset: Learning with Workspace Constraints,” in Conference on Intelligent Robots and Systems (IROS), 2019.
  22. K. Karunratanakul, J. Yang, Y. Zhang, M. J. Black, K. Muandet, and S. Tang, “Grasping Field: Learning Implicit Representations for Human Grasps,” in Conference on 3D Vision (3DV), 2020.
  23. S. Ainetter and F. Fraundorfer, “End-to-end Trainable Deep Neural Network for Robotic Grasp Detection and Semantic Segmentation from RGB,” in International Conference on Robotics and Automation (ICRA), 2021.
  24. F.-J. Chu, R. Xu, and P. A. Vela, “Real-World Multiobject, Multigrasp Detection,” Robotics and Automation Letters (RAL), 2018.
  25. R. Zurbrügg, Y. Liu, F. Engelmann, S. Kumar, M. Hutter, V. Patil, and F. Yu, “ICGNet: A Unified Approach for Instance-Centric Grasping,” in International Conference on Robotics and Automation (ICRA), 2024.
  26. A. Ten Pas, M. Gualtieri, K. Saenko, and R. Platt, “Grasp Pose Detection in Point Clouds,” Journal of Robotics Research (IJRR), 2017.
  27. A. Mousavian, C. Eppner, and D. Fox, “6-DoF GraspNet: Variational Grasp Generation for Object Manipulation,” in International Conference on Computer Vision (ICCV), 2019.
  28. H. Duan, P. Wang, Y. Huang, G. Xu, W. Wei, and X. Shen, “Robotics Dexterous Grasping: The Methods Based on Point Cloud and Deep Learning,” Frontiers in Neurorobotics, 2021.
  29. H.-S. Fang, C. Wang, M. Gou, and C. Lu, “GraspNet-1Billion: A Large-Scale Benchmark for General Object Grasping,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  30. D.-C. Hoang, J. A. Stork, and T. Stoyanov, “Voting and Attention-Based Pose Relation Learning for Object Pose Estimation From 3D Point Clouds,” Robotics and Automation Letters (RAL), 2022.
  31. H.-S. Fang, C. Wang, H. Fang, M. Gou, J. Liu, H. Yan, W. Liu, Y. Xie, and C. Lu, “Anygrasp: Robust and efficient grasp perception in spatial and temporal domains,” IEEE Transactions on Robotics, 2023.
  32. A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans,” Robotics: Science and Systems (RSS), 2020.
  33. M. Chang, T. Gervet, M. Khanna, S. Yenamandra, D. Shah, S. Y. Min, K. Shah, C. Paxton, S. Gupta, D. Batra et al., “Goat: Go to Any Thing,” arXiv preprint arXiv:2311.06430, 2023.
  34. M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as i Can, Not as I Say: Grounding Language in Robotic Affordances,” arXiv preprint arXiv:2204.01691, 2022.
  35. N. Yokoyama, A. W. Clegg, E. Undersander, S. Ha, D. Batra, and A. Rai, “Adaptive Skill Coordination for Robotic Mobile Manipulation,” arXiv preprint arXiv:2304.00410, 2023.
  36. P. Liu, Y. Orru, C. Paxton, N. M. M. Shafiullah, and L. Pinto, “Ok-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics,” arXiv preprint arXiv:2401.12202, 2024.
  37. Boston Dynamics, “Spot: The Agile Mobile Robot,” https://bostondynamics.com/products/spot/, 2023, accessed: 2024-03-10.
  38. ——, “Spot SDK Documentation,” https://dev.bostondynamics.com/, 2024, accessed: 2024-03-10.
  39. Laan Labs, “3D Scanner App,” https://apps.apple.com/us/app/3d-scanner-app/id1419913995, accessed: 2023-12-26.
  40. A. Delitzas, A. Takmaz, F. Tombari, R. Sumner, M. Pollefeys, and F. Engelmann, “SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  41. G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLO,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics
  42. M. Arduengo, C. Torras, and L. Sentis, “Robust and Adaptive Door Operation with a Mobile Robot,” Intelligent Service Robotics, 2021.
  43. M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography,” 1981.
  44. A. Takmaz, J. Schult, I. Kaftan, M. Akcay, B. Leibe, R. Sumner, F. Engelmann, and S. Tang, “3D Segmentation of Humans in Point Clouds with Synthetic Data,” in International Conference on Computer Vision (ICCV), 2023.
  45. Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation,” in International Conference on Neural Information Processing Systems (NeurIPS), 2022.
  46. M. Minderer, A. Gritsenko, and N. Houlsby, “Scaling Open-Vocabulary Object Detection,” in International Conference on Neural Information Processing Systems (NeurIPS), 2023.
Citations (6)

Summary

  • The paper introduces a novel framework that leverages open-vocabulary 3D segmentation and adaptive grasping to achieve robust object retrieval and drawer manipulation.
  • It integrates methods like OpenMask3D and AnyGrasp with joint pose optimization, enabling dynamic navigation and precise interaction in point clouds.
  • Experimental results show a 51% success rate for object retrieval and 82% for drawer manipulation, indicating both advancements and challenges in robotic perception.

Advanced Robotic Manipulation in Human-Centric Environments: An Analysis of Spot-Compose

The paper "Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds" presents a sophisticated framework for robotic interaction within human-centric environments using modern techniques in deep learning and robotics manipulation. This framework integrates open-vocabulary instance segmentation and grasp pose estimation from point clouds to enhance robotic capabilities in dynamic object retrieval and drawer manipulation. The research utilizes the Boston Dynamics Spot robot, demonstrating an iterative advancement in robotic perception and manipulation technologies.

Framework and Methodological Components

The framework's primary technical components involve 3D instance segmentation, grasp pose estimation, adaptive navigation, and dynamic drawer detection. These components are brought together to enable the robot to interact with diverse objects and concealed spaces.

  1. 3D Instance Segmentation and Object Localization: The paper employs OpenMask3D for open-vocabulary 3D instance segmentation, enabling the robot to interpret and navigate the 3D space using natural language queries. This allows for mapping of the environment and precise localization of objects of interest.
  2. Adaptive Grasping: Utilizing AnyGrasp, the framework performs grasp pose estimation directly on point clouds. This system is enhanced by acknowledging the object's center of mass, increasing the robustness and stability of grasps. Multiple detection iterations enable comprehensive grasp pose identification.
  3. Adaptive Navigation and Joint Optimization: The navigation task involves determining optimal robot positioning for object retrieval, balancing between collision-free travel paths and effective grasping alignment using joint optimization.
  4. Dynamic Drawer Detection and Axis Motion Estimation: The robotic system employs a combination of pre-scanned 3D data and real-time RGBD camera input to detect and manipulate drawers. This is crucial for accessing concealed spaces and enhances the robot's utility in human environments.
  5. Potential for Capability Expansion: The paper illustrates potential expansions, such as task development for mobile search robots and integration of natural language processing for intuitive human-robot interaction.

Experimental Evaluation and Results

The framework is evaluated through real-world experiments involving dynamic object retrieval and drawer manipulation tasks, where a success rate of 51% and 82% is reported respectively. Notably, challenges persist in detection accuracy and object manipulation owing to perceptual disparities and the complexities inherent in dynamic human environments. These findings prompt further exploration into robust 3D perception and tactile manipulation techniques.

Conclusion and Future Perspectives

Spot-Compose is presented as an accessible framework that leverages cutting-edge machine perception and robotic manipulation methodologies. By enabling advanced functionalities and fostering future integration of emerging technologies, it serves as a significant step toward enhancing robotic interactions in spaces traditionally designed for humans. Upcoming research directions include refining grasp trajectory planning and optimizing object navigation to address existing methodological constraints.

Future developments are likely to explore enhancing perceptual algorithms and integrating advanced AI-based decision frameworks that can further bridge the gap between human and robotic collaboration in shared environments. As AI and robotics research continues to mature, systems like Spot-Compose will play vital roles in shaping future human-centric robotic applications.

Youtube Logo Streamline Icon: https://streamlinehq.com