Segment3D: Learning Fine-Grained Class-Agnostic 3D Segmentation without Manual Labels (2312.17232v1)
Abstract: Current 3D scene segmentation methods are heavily dependent on manually annotated 3D training datasets. Such manual annotations are labor-intensive, and often lack fine-grained details. Importantly, models trained on this data typically struggle to recognize object classes beyond the annotated classes, i.e., they do not generalize well to unseen domains and require additional domain-specific annotations. In contrast, 2D foundation models demonstrate strong generalization and impressive zero-shot abilities, inspiring us to incorporate these characteristics from 2D models into 3D models. Therefore, we explore the use of image segmentation foundation models to automatically generate training labels for 3D segmentation. We propose Segment3D, a method for class-agnostic 3D scene segmentation that produces high-quality 3D segmentation masks. It improves over existing 3D segmentation models (especially on fine-grained masks), and enables easily adding new training data to further boost the segmentation performance -- all without the need for manual training labels.
- 3D Semantic Parsing of Large-Scale Indoor Spaces. In CVPR, 2016.
- ARKitScenes: A Diverse Real-World Dataset for 3D Indoor Scene Understanding using Mobile RGB-D Data. arXiv preprint arXiv:2111.08897, 2021.
- ZoeDepth: Zero-Shot Transfer by Combining Relative and Metric Depth. arXiv preprint arXiv:2302.12288, 2023.
- On the Opportunities and Risks of Foundation Models. arXiv preprint arXiv:2108.07258, 2021.
- End-to-End Object Detection with Transformers. In ECCV, 2020.
- Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021.
- CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP. In CVPR, 2023.
- Hierarchical Aggregation for 3D Instance Segmentation. In ICCV, 2021.
- Masked-Attention Mask Transformer for Universal Image Segmentation. In CVPR, 2022.
- Per-Pixel Classification is Not All You Need for Semantic Segmentation. In NeurIPS, 2021.
- 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. In CVPR, 2019.
- ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In CVPR, 2017.
- BundleFusion: Real-time Globally Consistent 3D Reconstruction using On-the-fly Surface Re-integration. TOG, 2017.
- PLA: Language-Driven Open-Vocabulary 3D Scene Understanding. In CVPR, 2023.
- 3D-MPA: Multi-Proposal Aggregation for 3D Semantic Instance Segmentation. In CVPR, 2020.
- A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD, 1996.
- Efficient Graph-Based Image Segmentation. IJCV, 2004.
- Open-Vocabulary Object Detection via Vision and Language Knowledge Distillation. In ICLR, 2022.
- Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models. In CoRL, 2022.
- CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training. In ICCV, 2023.
- KinectFusion: Real-time 3D Reconstruction and Interaction using a Moving Depth Camera. In UIST, 2011.
- Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision. In ICML, 2021.
- PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation. In CVPR, 2020.
- LERF: Language Embedded Radiance Fields. In ICCV, 2023.
- Segment Anything. In ICCV, 2023.
- Decomposing Nerf for Editing via Feature Field Distillation. In NeurIPS, 2022.
- Interactive Object Segmentation in 3D Point Clouds. In ICRA, 2023.
- 4D-StOP: Panoptic Segmentation of 4D LiDAR using Spatio-temporal Object Proposal Generation and Aggregation. 2022.
- Instance Segmentation in 3D Scenes using Semantic Superpoint Tree Networks. In CVPR, 2021.
- Query Refinement Transformer for 3D Instance Segmentation. In ICCV, 2023.
- Open-Vocabulary Point-Cloud Object Detection without 3D Annotation. In CVPR, 2023.
- V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In 3DV, 2016.
- Mix3D: Out-of-context data augmentation for 3D scenes. In 3DV, 2021.
- DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193, 2023.
- StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In ICCV, 2021.
- OpenScene: 3D Scene Understanding with Open Vocabularies. In CVPR, 2023.
- PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In CVPR, 2017.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
- Learning Transferable Visual Models from Natural Language Supervision. In ICML, 2021.
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. In CVPR, 2022.
- Paris-Lille-3D: A Large and High-Quality Ground-Truth Urban Point Cloud Dataset for Automatic Segmentation and Classification. IJRR, 2018.
- Language-Grounded Indoor 3D Semantic Segmentation in the Wild. In ECCV, 2022.
- Mask3D: Mask Transformer for 3D Semantic Instance Segmentation. In ICRA, 2023.
- DualConvMesh-Net: Joint Geodesic and Euclidean Convolutions on 3D Meshes. In CVPR, 2020.
- The Replica Dataset: A Digital Replica of Indoor Spaces. arXiv preprint arXiv:1906.05797, 2019.
- Superpoint Transformer for 3D Scene Instance Segmentation. In AAAI, 2023.
- OpenMask3D: Open-Vocabulary 3D Instance Segmentation. In NeurIPS, 2023.
- KPConv: Flexible and Deformable Convolution for Point Clouds. In ICCV, 2019.
- SoftGroup for 3D Instance Segmentation on Point Clouds. In CVPR, 2022.
- LabelMaker: Automatic Semantic Label Generation from RGB-D Trajectories. In 3DV, 2024.
- ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic Reconstruction. arXiv, 2023.
- Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds. In NeurIPS, 2019.
- SAM3D: Segment Anything in 3D Scenes. In ICCVW, 2023.
- ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes. In ICCV, 2023.
- GSPN: Generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud. In CVPR, 2019.
- AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation. In arXiv preprint arXiv:2306.00977, 2023.
- CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data. In CVPR, 2023.
- PointCLIP: Point Cloud Understanding by CLIP. In CVPR, 2022.
- Rui Huang (128 papers)
- Songyou Peng (41 papers)
- Ayca Takmaz (7 papers)
- Federico Tombari (214 papers)
- Marc Pollefeys (230 papers)
- Shiji Song (103 papers)
- Gao Huang (178 papers)
- Francis Engelmann (37 papers)