S²AM3D: Scale-Controllable 3D Segmentation

Updated 7 December 2025

S²AM3D is a framework that integrates 2D segmentation priors with 3D contrastive learning to achieve globally consistent part segmentation.
It employs a two-stage pipeline with a point-consistent encoder and a scale-aware decoder, enabling real-time granularity control.
Empirical results show that scale control boosts mean IoU by over 10 points on benchmarks like PartNet-E and Objaverse-Tiny.

S2AM3D denotes a spectrum of algorithms and systems targeting 3D part-level segmentation by leveraging 2D segmentation priors, strong global consistency via 3D contrastive learning, and real-time granularity control through scale-aware prompt decoding. The term primarily refers to a recent state-of-the-art framework—S²AM3D: Scale-controllable Part Segmentation of 3D Point Cloud (Su et al., 30 Nov 2025)—but is also used in the mesh domain as Segment Any Mesh (same acronym, (Tang et al., 2024)). The following exposition focuses on the point cloud methodology as defined in (Su et al., 30 Nov 2025), providing comprehensive coverage of its architecture, algorithms, data, evaluation, and significance relative to prior art.

1. Motivation and Problem Setting

Part-level segmentation of 3D data is fundamental in computational geometry, robotics, and digital content creation, requiring both global instance consistency—e.g., all segments of a handle are classified uniformly—and precise local boundary detection. The two major challenges that have hindered progress are (a) limited annotated 3D part datasets (existing ShapeNet-Part, PartNet benchmarks provide tens of thousands of annotated models with limited category coverage), and (b) inherent inconsistencies when using 2D vision models (e.g., Segment Anything Model, SAM) for 3D segmentation through multi-view rendering, which yield cross-view label conflicts due to occlusions and topological ambiguities. S²AM3D addresses both issues by combining multi-view 2D distillation with a native 3D contrastive learning objective and by curating a new, much larger dataset for training and evaluation (Su et al., 30 Nov 2025).

2. Model Architecture and Training Pipeline

S²AM3D is structured as a two-stage pipeline:

Point-Consistent Part Encoder: A voxel-based backbone (PVCNN) encodes input point clouds $P\in\mathbb{R}^{N\times3}$ , producing latent tri-plane features $\mathbf T\in\mathbb{R}^{3\times D\times H\times W}$ . Slices of these tri-planes, rendered from random views, are supervised by frozen 2D segmentation features obtained from large pre-trained models (e.g., SAM). To guarantee global consistency, intra-instance supervised contrastive learning in 3D is conducted using normalized features $\mathbf f_i$ for sampled points, maximizing intra-part similarity and minimizing inter-part similarity via the objective

$\mathcal L_{\rm contr} =\frac{1}{|\hat P|}\sum_{i\in\hat P} -\log \frac{\sum_{j\in\hat P(i)}\exp(s_{ij})}{\sum_{k\in\hat P\setminus\{i\}}\exp(s_{ik})},$

where $s_{ij}=\mathbf f_i^\top\mathbf f_j/\tau$ and $\tau$ is a temperature parameter.

Scale-Aware Prompt Decoder: Decoupled from the encoder, the decoder enables interactive segmentation at controllable granularities. The decoder takes a point prompt $p$ and a scale parameter $s\in[0,1]$ , constructing a sinusoidal embedding $\mathbf e(s)$ with learnable frequencies and phases to modulate intermediate decoder states via FiLM. Bi-directional cross-attention layers fuse the global per-point feature matrix $\tilde F$ with the prompt-anchored feature $\tilde F_p$ , yielding per-point mask logits through a feedforward layer and Sigmoid activation. The decoder is trained with a hybrid loss: a dynamically weighted BCE term plus Dice to address class imbalance typical in part segmentation.

3. Scale-Controlled Interactive Segmentation

A key innovation of S²AM3D is explicit scale control. The model accepts a continuous scale prompt, allowing users to smoothly interpolate segmentation granularity from fine (small screws) to coarse (entire subassembly), in real time. The sinusoidal scale embedding, after linear transformation and FiLM modulation, provides this control in the decoder's Transformer layers. This enables, for example, easy toggling between different abstraction levels of part segmentation for the same point prompt.

Empirical results show that inclusion of scale-aware modulation improves mean IoU by more than 10 points on PartNet-E (54.39% to 70.63%) and Objaverse-Tiny (41.70% to 53.75%) relative to point-prompted baselines without scale conditioning (Su et al., 30 Nov 2025).

4. Dataset Design and Preprocessing

S²AM3D is trained on a newly curated dataset assembled from Objaverse, containing more than 100,000 shapes spanning over 400 categories and approximately 1.2 million part-level labels. The data curation pipeline includes:

Surface area–proportional sampling and annotation using PartNet-style part semantics.
Filtering using a lightweight PointNet validator trained to detect and remove annotation errors.
Connectivity refinement: DBSCAN clustering splits spatially disconnected regions sharing the same semantic label, with cluster radius determined by bounding-box diagonal scaling.
Shape exclusion criteria: shapes with fewer than 2 or more than 50 parts are removed to maintain class balance.

This scale and diversity yield significant improvements in robustness and generalization over subsets such as PartNet alone.

5. Experimental Evaluation and Quantitative Analysis

Evaluation uses mean per-object IoU, both for single point-prompted (“interactive”) and full multi-part segmentation. Key results include:

Method	Obj-Tiny	PartNet-E	Avg.
Point-SAM	31.46	50.23	40.85
P³-SAM	35.05	39.98	37.52
Ours (no scale)	41.70	54.39	48.05
Ours (+scale)	53.75	70.63	62.19

Ablation studies confirm that 3D contrastive learning yields the largest performance gain (~9 points improvement averaged over benchmarks), followed by dataset size/diversity and inclusion of scale embeddings (even when unexposed at test time). On full all-parts segmentation, S²AM3D outperforms Find3D, SAMPart3D, and PartField, setting a new state-of-the-art on PartNet-E.

6. Robustness, Applications, and Limitations

Fusing 2D segmentation priors with 3D geometric cues and enforcing global part-level consistency allows S²AM3D to resolve occlusion artifacts and ambiguous boundaries that confound view-by-view 2D segmentation transfer. The continuous scale parameter provides unprecedented granularity control, critical for robotics, CAD model editing, and content creation. However, the method currently supports only point+scale prompts, requiring downsampling for very large scenes to 10,000 points, which can lead to loss of extremely fine structure.

7. Future Directions

Promising avenues for further development include integration of textual or multimodal prompts (e.g., “select the door handle”), cross-modal editing loops involving shape diffusion and part re-generation, and automatic part hierarchy induction through hierarchical or tree-structured decoding. Addressing these would further close the gap toward fully interactive, cross-modal, and semantically rich 3D part understanding (Su et al., 30 Nov 2025).

S²AM3D establishes a new paradigm in scale-controllable, globally consistent 3D part segmentation, with a modular architecture that fuses the strengths of 2D vision models and native 3D learning and is validated on a dramatically expanded dataset. Its ability to control segmentation granularity on-the-fly addresses longstanding needs in both research and application settings.

Markdown Upgrade to Chat

References (2)

S2AM3D: Scale-controllable Part Segmentation of 3D Point Cloud (2025)

Segment Any Mesh (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to S2AM3D.