SAMPart3D: Zero-Shot 3D Part Segmentation
- SAMPart3D is a zero-shot 3D part segmentation framework that segments arbitrary objects into semantically meaningful parts at user-specified granularities.
- It leverages text-agnostic distillation from DINOv2 and a scale-conditioned grouping mechanism to overcome limitations of fixed, prompt-dependent segmentation.
- Semantic labels are applied post hoc via multimodal vision-language models, enabling practical applications in 3D content editing, robotics, and dataset generation.
SAMPart3D is a scalable, zero-shot 3D part segmentation framework that segments arbitrary 3D objects into semantically meaningful parts at user-specified granularities, without requiring predefined part categories or text prompts. The method relies on text-agnostic vision foundation model distillation to learn generic 3D priors from large unlabeled mesh datasets, and introduces a scale-conditioned grouping mechanism to enable flexible part decomposition. Semantic part labeling is decoupled from geometric segmentation and is provided post hoc by multimodal vision-LLMs (VLMs) based on multi-view renderings. SAMPart3D addresses the rigidity and scalability bottlenecks of prior 2D-to-3D prompt-driven methods and supports applications in robotics, 3D content creation, and editing (Yang et al., 2024).
1. Motivation and Contributions
SAMPart3D was developed to overcome two critical limitations of existing zero-shot 3D part segmentation pipelines that transfer knowledge from 2D vision-LLMs:
- Prompt Dependence and Scalability: Prior approaches, such as those using GLIP or CLIP, require the user to specify a discrete, fixed set of part labels (text prompts) at inference, limiting scalability to uncurated, large-scale object repositories and restricting the flexibility needed to handle ambiguous or unnamed parts.
- Granularity Rigidity: Text- or prompt-driven approaches typically force segmentation at a single, user-defined part vocabulary, impeding interactive or hierarchical part decomposition.
SAMPart3D introduces several advances:
- Text-agnostic distillation: It leverages DINOv2—an open, self-supervised 2D vision foundation model—to distill geometry-aware visual priors into a 3D backbone from large-scale, unannotated datasets (e.g., Objaverse).
- Scale-conditioned grouping: Instead of a fixed part vocabulary, it learns a scale-conditioned field supporting arbitrary levels of part granularity, producing both coarse and fine subdivisions for any object.
- Plug-in semantic labeling: Once parts are segmented geometrically, a VLM is queried with multi-view renderings to assign freeform semantic names, removing the need for text prompts during 3D grouping.
- Benchmark contribution: The framework introduces PartObjaverse-Tiny, a challenging benchmark of 200 diverse and complex objects with fine-grained part annotations, advancing standardization in the field.
These innovations position SAMPart3D as the first promptless, scale-flexible zero-shot 3D part segmentation framework that generalizes to open-world, heterogeneous 3D objects (Yang et al., 2024).
2. Architectural Design and Algorithmic Pipeline
The SAMPart3D architecture is composed of three sequential modules: (a) text-agnostic 2D-to-3D feature distillation, (b) scale-conditioned part-aware feature grouping, and (c) decoupled semantic part labeling.
2.1. Foundation Model Distillation
A 3D backbone , based on a modified Point Transformer V3 (PTv3-object), is trained to align its features with those of a frozen DINOv2 model. The process is as follows:
- Surface sampling: points are sampled on the mesh surface. computes 3D features for these points.
- Multi-view rendering and feature extraction: The mesh is rendered from views, and DINOv2 features are extracted and projected back to the 3D points using occlusion-aware depth mapping, yielding per view and an average feature .
- Feature regression: The backbone is supervised via an L2 loss:
imparting dense 2D visual semantics to the 3D features, without ever using text labels.
After distillation on 200K objects from Objaverse, serves as a universal open-world shape encoder with geometry-aware features.
2.2. Scale-conditioned Part Grouping
Following feature pretraining, the backbone is frozen. A lightweight “grouping field” is learned, which, for a specified scale , maps each 3D point 0 to a part-aware embedding 1.
- Mask extraction: The object is rendered again; SAM provides multi-view 2D masks, with each mask back-projected onto the surface.
- Scale estimation: For each mask, compute
2
where 3 are standard deviations of the patch, and 4.
- Embedding formulation: Embeddings encode both backbone and position for each point, parameterized by the mask-derived scale.
- Contrastive grouping loss:
5
where 6 is a margin parameter.
At inference, 7 is recomputed for a user-selected scale 8 and clustered (via HDBSCAN or similar) to partition the object at the desired granularity. Smaller 9 produces finer splits; larger 0 yields coarser segmentation.
2.3. Zero-shot Semantic Assignment
After geometric part segmentation, semantic labels are assigned as follows:
- The mesh is rendered from canonical viewpoints.
- For each part, the view with maximal image area is selected; its corresponding pixels are highlighted and included in a prompt to a multimodal LLM, e.g., GPT-4o.
- The LLM is instructed to describe the highlighted region (e.g., “What is this part?”). Output strings (“chair back,” “lamp shade”) serve as labels.
This decoupling enables arbitrary vocabulary coverage and relieves the need to pre-specify part names during the 3D grouping phase (Yang et al., 2024).
3. Comparison to Related 3D Segmentation Approaches
SAMPart3D's pipeline differs fundamentally from previous and contemporary approaches:
- PartSAM (Zhu et al., 26 Sep 2025) implements a dual-branch triplane encoder trained natively on large-scale 3D part annotations and supports promptable interactive segmentation directly in 3D, integrating both learned and fixed 2D priors.
- PartSTAD (Kim et al., 2024) extends 2D bounding-box-based semantic proposals to 3D using GLIP and SAM, with refinement of part boundaries via mask weighting and 2D-to-3D lifting, but ultimately remains prompt- and category-dependent.
- P3-SAM (Ma et al., 8 Sep 2025) introduces a point-promptable, multi-head segmentation model operating natively on point clouds, emphasizing automation and interactive part labeling without text prompts.
- SAMesh (Tang et al., 2024) utilizes multimodal 2D renderings (normals, SDF) and projects 2D SAM-segmented masks to 3D via match graphs, focusing on fully zero-shot segmentation using 2D foundation models but without learned 3D priors.
- PatchAlign3D (Hadgi et al., 5 Jan 2026) leverages multi-view SAM segmentation and VLM captioning to create part-annotated training sets, then pretrains a transformer for patch-level alignment and efficient zero-shot inference.
SAMPart3D's distinctive technical contributions are its text-agnostic, scale-adaptive grouping and its complete decoupling of geometric and semantic segmentation processes.
4. Experimental Validation
The performance of SAMPart3D has been validated with both quantitative benchmarks and qualitative analysis:
- PartObjaverse-Tiny Benchmark: 200 complex objects across 8 categories (Humans, Animals, Daily objects, Buildings, Transport, Plants, Food, Electronics), each with fine-grained part labels.
- Metrics: Class-agnostic mIoU (geometry only), semantic mIoU (per-category), and mAP₅₀ (instance segmentation).
| Dataset | Task | SAMPart3D | Prior Best |
|---|---|---|---|
| PartObjaverse-Tiny | Zero-shot semantic mIoU | 34.7% | 24.3% (PartSLIP) |
| PartObjaverse-Tiny | Class-agnostic mIoU | 53.7% | 43.6% (SAM3D) |
| PartObjaverse-Tiny | Instance mAP₅₀ | 30.2% | 16.3% (PartSLIP) |
| PartNetE | mIoU | 41.2% | 39.9% (PartDistill) |
Qualitative results show robust handling of atypical and highly non-standard objects (e.g., fantasy creatures, variant furniture), clean transitions between fine and coarse part decompositions, and interactive adaptation via a scale “slider.”
5. Practical Applications
SAMPart3D serves various downstream roles, including:
- 3D content editing: Part-aware assignment of materials or colors in environments such as Blender.
- Interactive segmentation: Users specify a 3D point and scale to obtain hierarchical (coarse-to-fine) decompositions, supporting high-level control.
- Dataset generation: Serves as an automatic annotator to boost large-scale 3D part-labeled data for supervised model training.
- 3D perception in robotics: Zero-shot adaptation to arbitrary objects supports manipulation and planning tasks in unstructured environments.
6. Limitations and Future Directions
Key limitations include:
- The grouping field must be retrained per object, incurring a fixed mask-generation and lightweight MLP fitting cost (~5 minutes).
- Errors in initial 2D SAM masks, especially at very fine scales, may propagate into the 3D grouping.
- Semantic part queries depend on the capabilities of the off-the-shelf VLM; rare or highly ambiguous parts can yield noisy or inconsistent labels.
Open research directions:
- Amortization: Collapsing the per-object grouping retraining into a global scale-conditioned feed-forward 3D network.
- Self-training: Leveraging imperfect masks for semi-supervised or self-supervised learning on large, unlabeled collections.
- Scene-level segmentation: Extending from object-level to full-scene, context-aware part decomposition.
SAMPart3D thus establishes a new paradigm in open-world, zero-shot 3D part segmentation, characterized by promptless, multi-scale grouping and fully decoupled semantic labeling, and lays the foundation for future advances in large-scale, flexible 3D perception (Yang et al., 2024).