Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation (2311.14262v3)

Published 24 Nov 2023 in cs.CV

Abstract: Zero-shot 3D part segmentation is a challenging and fundamental task. In this work, we propose a novel pipeline, ZeroPS, which achieves high-quality knowledge transfer from 2D pretrained foundation models (FMs), SAM and GLIP, to 3D object point clouds. We aim to explore the natural relationship between multi-view correspondence and the FMs' prompt mechanism and build bridges on it. In ZeroPS, the relationship manifests as follows: 1) lifting 2D to 3D by leveraging co-viewed regions and SAM's prompt mechanism, 2) relating 1D classes to 3D parts by leveraging 2D-3D view projection and GLIP's prompt mechanism, and 3) enhancing prediction performance by leveraging multi-view observations. Extensive evaluations on the PartNetE and AKBSeg benchmarks demonstrate that ZeroPS significantly outperforms the SOTA method across zero-shot unlabeled and instance segmentation tasks. ZeroPS does not require additional training or fine-tuning for the FMs. ZeroPS applies to both simulated and real-world data. It is hardly affected by domain shift. The project page is available at https://luis2088.github.io/ZeroPS_page.

"ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation" presents a novel approach to utilize 2D pretrained foundational models for the advancement of zero-shot 3D part segmentation. The introduction of ZeroPS is predicated on leveraging the inherent relationship between multi-view correspondences of 2D images and the prompting mechanisms within foundational models to enable effective knowledge transfer to 3D point clouds.

Methodology Breakdown:

  1. Self-Extension Component: This component facilitates the extension of 2D image groups into 3D space. By starting from a single viewpoint, it generates spatially coherent global-level 3D groups that effectively correspond to the 2D observations. This step ensures an accurate and holistic representation of 3D structures by leveraging the multi-view consistency.
  2. Multi-Modal Labeling Component:
    • Two-Dimensional Checking Mechanism: This mechanism employs two-dimensional voting to determine which 2D predicted bounding boxes best correspond to the various parts of the 3D structure. Votes are aggregated to ensure that the best matches get selected.
    • Class Non-Highest Vote Penalty Function: This function aims to refine the Vote Matrix by penalizing votes that do not correspond to the highest confidence predictions, thereby improving the overall accuracy of part segmentation.
    • A final merging algorithm is used to consolidate part-level 3D groups, creating a seamless integration of the 2D and 3D information.

Evaluation and Results:

Extensive experimentation was conducted on three distinct zero-shot segmentation tasks using the PartnetE datasets. The results demonstrated substantial performance gains, showing improvements of 19.6%, 5.2%, and 4.9% over existing state-of-the-art methods.

Highlights:

  • Zero Training/Fine-tuning Required: A significant advantage of ZeroPS is that it operates entirely without the need for training, fine-tuning, or any learnable parameters, making it exceptionally efficient and easy to deploy.
  • Robustness to Domain Shift: ZeroPS demonstrates high resilience to domain shifts, maintaining its performance across different data variations and scenarios.
  • Code Availability: With a commitment to open science, the authors have indicated that the code for ZeroPS will be released, facilitating further research and replication efforts.

In summary, ZeroPS stands out as a pioneering method for zero-shot 3D part segmentation, leveraging the strengths of pretrained 2D models in a cross-modal fashion to achieve unparalleled accuracy and robustness in 3D environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. A 3d shape segmentation approach for robot grasping by parts. Robotics and Autonomous Systems, 60(3):358–366, 2012.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Mopa: Multi-modal prior aided domain adaptation for 3d semantic segmentation. arXiv preprint arXiv:2309.11839, 2023.
  4. Sad: Segment any rgbd. arXiv preprint arXiv:2305.14207, 2023.
  5. Bridging the domain gap: Self-supervised 3d scene understanding with foundation models. arXiv preprint arXiv:2305.08776, 2023.
  6. Box2mask: Weakly supervised 3d semantic instance segmentation using bounding boxes. In European Conference on Computer Vision, pages 681–699. Springer, 2022.
  7. Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems, 29, 2016.
  8. Label-efficient learning on point clouds using approximate convex decompositions. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16, pages 473–491. Springer, 2020.
  9. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  10. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  11. Partglot: Learning shape part segmentation from language reference games. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16505–16514, 2022.
  12. Stratified transformer for 3d point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8500–8509, 2022.
  13. Efem: Equivariant neural field expectation maximization for 3d object segmentation without scene supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4902–4912, 2023.
  14. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
  15. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
  16. Masked discrimination for self-supervised learning on point clouds. In European Conference on Computer Vision, pages 657–675. Springer, 2022a.
  17. Self-prediction for joint instance and semantic segmentation of point clouds. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 187–204. Springer, 2020.
  18. Frame mining: a free lunch for learning robotic manipulation from 3d point clouds. arXiv preprint arXiv:2210.07442, 2022b.
  19. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21736–21746, 2023.
  20. Autogpart: Intermediate supervision search for generalizable 3d part segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11624–11634, 2022c.
  21. Learning to group: A bottom-up framework for 3d part discovery in unseen categories. arXiv preprint arXiv:2002.06478, 2020.
  22. Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019a.
  23. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019b.
  24. Masked autoencoders for point cloud self-supervised learning. In European conference on computer vision, pages 604–621. Springer, 2022.
  25. Sam-guided unsupervised domain adaptation for 3d segmentation. arXiv preprint arXiv:2310.08820, 2023.
  26. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
  27. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017b.
  28. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  30. Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020.
  31. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  32. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  33. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  34. Mask3d for 3d semantic instance segmentation. arXiv preprint arXiv:2210.03105, 2022.
  35. Self-supervised few-shot learning on point clouds. Advances in Neural Information Processing Systems, 33:7212–7221, 2020.
  36. Mvdecor: Multi-view dense correspondence learning for fine-grained 3d segmentation. In European Conference on Computer Vision, pages 550–567. Springer, 2022.
  37. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019.
  38. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  39. Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2022.
  40. Few-shot learning of part-specific probability space for 3d shape segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4504–4513, 2020.
  41. Ikea-manual: Seeing shape assembly step by step. Advances in Neural Information Processing Systems, 35:28428–28440, 2022.
  42. Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2569–2578, 2018.
  43. Learning fine-grained segmentation of 3d shapes without part labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10276–10285, 2021.
  44. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
  45. Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.
  46. Weakly supervised semantic point cloud segmentation: Towards 10x fewer labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13706–13715, 2020.
  47. Unsupervised kinematic motion detection for part-segmented 3d shape collections. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022.
  48. Sam3d: Segment anything in 3d scenes. arXiv preprint arXiv:2306.03908, 2023.
  49. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6):1–12, 2016.
  50. Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3947–3956, 2019.
  51. Partnet: A recursive part decomposition network for fine-grained and hierarchical shape segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9491–9500, 2019.
  52. When 3d bounding-box meets sam: Point cloud instance segmentation with weak-and-noisy supervision. arXiv preprint arXiv:2309.00828, 2023.
  53. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322, 2022.
  54. Point cloud instance segmentation using probabilistic embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8883–8892, 2021.
  55. Growsp: Unsupervised semantic segmentation of 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17619–17629, 2023.
  56. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
  57. Divide and conquer: 3d point cloud instance segmentation with point-wise binarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 562–571, 2023.
  58. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2639–2650, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yuheng Xue (1 paper)
  2. Nenglun Chen (17 papers)
  3. Jun Liu (606 papers)
  4. Wenyun Sun (1 paper)
Citations (4)