PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models (2212.01558v2)
Abstract: Generalizable 3D part segmentation is important but challenging in vision and robotics. Training deep models via conventional supervised methods requires large-scale 3D datasets with fine-grained part annotations, which are costly to collect. This paper explores an alternative way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-LLM, GLIP, which achieves superior performance on open-vocabulary 2D detection. We transfer the rich knowledge from 2D to 3D through GLIP-based part detection on point cloud rendering and a novel 2D-to-3D label lifting algorithm. We also utilize multi-view 3D priors and few-shot prompt tuning to boost performance significantly. Extensive evaluation on PartNet and PartNet-Mobility datasets shows that our method enables excellent zero-shot 3D part segmentation. Our few-shot version not only outperforms existing few-shot approaches by a large margin but also achieves highly competitive results compared to the fully supervised counterpart. Furthermore, we demonstrate that our method can be directly applied to iPhone-scanned point clouds without significant domain gaps.
- Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
- A 3d shape segmentation approach for robot grasping by parts. Robotics and Autonomous Systems, 60(3):358–366, 2012.
- Joint supervised and self-supervised learning for 3d real world challenges. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 6718–6725. IEEE, 2021.
- Towards part-based understanding of rgb-d scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7484–7494, 2021.
- Text and image guided 3d avatar generation and manipulation. arXiv preprint arXiv:2202.06079, 2022.
- Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
- Bae-net: Branched autoencoder for shape co-segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8490–8499, 2019.
- Box2mask: Weakly supervised 3d semantic instance segmentation using bounding boxes. In European Conference on Computer Vision, pages 681–699. Springer, 2022.
- Icm-3d: Instantiated category modeling for 3d instance segmentation. IEEE Robotics and Automation Letters, 7(1):57–64, 2021.
- Voxel-informed language grounding. arXiv preprint arXiv:2205.09710, 2022.
- Blenderproc. arXiv preprint arXiv:1911.01911, 2019.
- Label-efficient learning on point clouds using approximate convex decompositions. In European Conference on Computer Vision, pages 473–491. Springer, 2020.
- Compositionally generalizable 3d structure prediction. arXiv preprint arXiv:2012.02493, 2020.
- Unsupervised multi-task feature learning on point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8160–8171, 2019.
- Learning and memorizing representative prototypes for 3d point cloud semantic and instance segmentation. In European Conference on Computer Vision, pages 564–580. Springer, 2020.
- Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
- 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4421–4430, 2019.
- Exploring data-efficient 3d scene understanding with contrastive scene contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15587–15597, 2021.
- Zero-shot text-guided object generation with dream fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 867–876, 2022.
- Putting nerf on a diet: Semantically consistent few-shot view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5885–5894, 2021.
- Nikolay Jetchev. Clipmatrix: Text-controlled creation of 3d textured meshes. arXiv preprint arXiv:2109.12922, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Pointgroup: Dual-set point grouping for 3d instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and Pattern recognition, pages 4867–4876, 2020.
- Text to mesh without 3d supervision using limit subdivision. arXiv preprint arXiv:2203.13333, 2022.
- Semantic implicit neural scene representations with semi-supervised training. In 2020 International Conference on 3D Vision (3DV), pages 423–433. IEEE, 2020.
- Partglot: Learning shape part segmentation from language reference games. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16505–16514, 2022.
- Cut pursuit: Fast algorithms to learn piecewise constant functions on general weighted graphs. SIAM Journal on Imaging Sciences, 10(4):1724–1766, 2017.
- Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4558–4567, 2018.
- Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
- Self-prediction for joint instance and semantic segmentation of point clouds. In European Conference on Computer Vision, pages 187–204. Springer, 2020.
- Frame mining: a free lunch for learning robotic manipulation from 3d point clouds. arXiv preprint arXiv:2210.07442, 2022.
- Less: Label-efficient semantic segmentation for lidar point clouds. In European Conference on Computer Vision, pages 70–89. Springer, 2022.
- Autogpart: Intermediate supervision search for generalizable 3d part segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11624–11634, 2022.
- Relation-shape convolutional neural network for point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8895–8904, 2019.
- Box2seg: Learning semantics of 3d point clouds with box-level supervision. arXiv preprint arXiv:2201.02963, 2022.
- One thing one click: A self-training approach for weakly supervised 3d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1726–1736, 2021.
- Learning to group: A bottom-up framework for 3d part discovery in unseen categories. arXiv preprint arXiv:2002.06478, 2020.
- Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pages 181–196, 2018.
- Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
- Structurenet: Hierarchical graph networks for 3d shape generation. arXiv preprint arXiv:1908.00575, 2019.
- Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019.
- 3d compositional zero-shot learning with decompositional consensus. In European Conference on Computer Vision, pages 713–730. Springer, 2022.
- Scan2part: Fine-grained and hierarchical part-level understanding of real-world 3d scans. arXiv preprint arXiv:2206.02366, 2022.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. arXiv:2206.04670, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18082–18091, 2022.
- Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501, 2020.
- Language-grounded indoor 3d semantic segmentation in the wild. arXiv preprint arXiv:2204.07761, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Clip-forge: Towards zero-shot text-to-shape generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18603–18613, 2022.
- Prifit: Learning to fit primitives improves few shot point cloud segmentation. In Computer Graphics Forum, volume 41, pages 39–50. Wiley Online Library, 2022.
- Mvdecor: Multi-view dense correspondence learning for fine-grained 3d segmentation. arXiv preprint arXiv:2208.08580, 2022.
- Semi-supervised 3d shape segmentation with multilevel consistency and part substitution. arXiv preprint arXiv:2204.08824, 2022.
- Mortonnet: Self-supervised learning of local features in 3d point clouds. arXiv preprint arXiv:1904.00230, 2019.
- Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019.
- Language grounding with 3d objects. In Conference on Robot Learning, pages 1691–1701. PMLR, 2022.
- Softgroup for 3d instance segmentation on point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2708–2717, 2022.
- Clip-nerf: Text-and-image driven manipulation of neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3835–3844, 2022.
- Few-shot learning of part-specific probability space for 3d shape segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4504–4513, 2020.
- Ikea-manual: Seeing shape assembly step by step. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2569–2578, 2018.
- Associatively segmenting instances and semantics in point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4096–4105, 2019.
- Learning fine-grained segmentation of 3d shapes without part labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10276–10285, 2021.
- Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12, 2019.
- Approximate convex decomposition for 3d meshes with collision-aware concavity and tree search. arXiv preprint arXiv:2205.02961, 2022.
- Sapien: A simulated part-based interactive environment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11097–11107, 2020.
- Weakly supervised semantic point cloud segmentation: Towards 10x fewer labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13706–13715, 2020.
- Unsupervised kinematic motion detection for part-segmented 3d shape collections. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022.
- Learning object bounding boxes for 3d instance segmentation on point clouds. Advances in neural information processing systems, 32, 2019.
- An mil-derived transformer for weakly supervised point cloud segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11830–11839, 2022.
- A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6):1–12, 2016.
- Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3947–3956, 2019.
- Partnet: A recursive part decomposition network for fine-grained and hierarchical shape segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9491–9500, 2019.
- Point cloud instance segmentation using probabilistic embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8883–8892, 2021.
- Glipv2: Unifying localization and vision-language understanding. arXiv preprint arXiv:2206.05836, 2022.
- Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022.
- Weakly supervised semantic segmentation for large-scale point cloud. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3421–3429, 2021.
- Perturbed self-distillation: Weakly supervised large-scale point cloud semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15520–15528, 2021.
- Few-shot 3d point cloud semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8873–8882, 2021.
- In-place scene labelling and understanding with implicit scene representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15838–15847, 2021.
- Minghua Liu (22 papers)
- Yinhao Zhu (14 papers)
- Hong Cai (51 papers)
- Shizhong Han (26 papers)
- Zhan Ling (16 papers)
- Fatih Porikli (141 papers)
- Hao Su (217 papers)