Target-Tooth-Centroid Multi-Label Learning

Updated 28 November 2025

Target-tooth-centroid prompted multi-label learning uses explicit tooth centroids as spatial cues to guide and refine segmentation in dental imaging.
It integrates methods like SAMTooth, DArch, and CBCT segmentation to achieve accurate, instance-aware segmentation in both point cloud and volumetric data.
Experimental results show significant gains in mIoU and Dice scores under weak supervision, proving its effectiveness in maintaining shape fidelity and overcoming sparse labeling.

Target-tooth-centroid prompted multi-label learning is a framework in dental and craniofacial segmentation that leverages explicit or inferred tooth centroid locations to guide dense multi-label segmentation tasks. The core paradigm is to use spatial centroids as prompts or anchors that focus learning and inference on individual teeth, improving segmentation accuracy, shape fidelity, and robustness under sparse supervision. This approach is realized across both point cloud and volumetric representations, and has been validated in intra-oral scans, 3D dental models, and CBCT imagery.

1. Conceptual Foundations and Motivation

The essential principle of target-tooth-centroid prompted multi-label learning is to use a feature (centroid) corresponding to each target object (tooth) as a spatial prior to prompt individualized, label-aware instance segmentation. This strategy directly addresses the challenges of:

High cost of dense point-wise or voxel-wise annotation in dental imaging.
Shape ambiguities and anatomical variations, particularly in cases of adhesion or distorted boundaries.
Weak, sparse, or noisy supervision typical in large clinical datasets.

By structuring learning around these centroids—either labeled, detected, or inferred—the method enables multi-label segmentation with significantly reduced annotation burden and improves both localization and discrimination between closely apposed structures (Liu et al., 3 Sep 2024, Qiu et al., 2022, Ji et al., 21 Nov 2025).

2. Core Methodological Variants

2.1 SAMTooth for Sparse Point Cloud Segmentation

The SAMTooth pipeline uses a hybrid architecture, leveraging the Segment Anything Model (SAM) in 2D to supplement extremely sparse 3D supervision (one labeled point per tooth, ∼0.1% labeled points):

3D–2D Rendering and Mapping: Each 3D scan is rendered into multi-view 2D images with correspondence to 3D points.
Coarse Segmentation and Confidence Estimation: A ViT-B/16 (Point-BERT-style) backbone with a confidence head outputs per-point class scores and confidences.
Confidence-Aware Prompt Generation (CPG): Labeled points and high-confidence predictions are pooled for each predicted tooth class, filtered by an empirical threshold ( $\tau=0.6$ ). Centroids of filtered groups are computed and projected into 2D image space to serve as SAM prompts.
SAM Mask Acquisition and Reprojection: The prompted SAM outputs a binary mask for each tooth in each view, which is then reprojected to 3D, yielding updated segmentations for representation learning.
Mask-Guided Representation Learning (MRL): Foreground contrastive losses are imposed on features within each mask and background constraints on non-tooth points.
Joint Optimization: Warmup epochs optimize only the confidence-aware co-segmentation loss, after which contrastive and background losses are incorporated into the objective.

This pipeline leverages precise centroid localization and prompt design as the mechanism for transferring shape cues from 2D to the unlabeled 3D point set, making possible robust multi-label segmentation from minimal annotation (Liu et al., 3 Sep 2024).

2.2 DArch: Centroid-Prompted Patch-Based Segmentation

DArch decomposes instance segmentation into centroid detection and patch-based segmentation, explicitly modeling the arch prior:

Tooth Centroid Detection: VoteNet with a dental-arch branch regresses a parametric cubic Bézier curve, refined by a GCN, to model the dental arch. Arch-Aware Point Sampling (APS) is used to select seeds close to the arch. Grouped seeds feed into MLP heads to output centroid proposals.
Patch Segmentation: For each centroid, a patch of nearest 3D points is extracted. Features for each patch are constructed relative to its centroid. A localized segmentation network (PointNet++) outputs a binary mask for tooth/background within the patch.
Multi-label Fusion: Patch masks are merged by averaging, generating a full multi-tooth segmentation. Weak supervision is possible by providing only centroid locations and partial masks.

Subdivision by centroids as prompts ensures individual tooth attention and correct multi-label output with minimal annotation, leveraging the spatial prior of contextually correct centroid placement (Qiu et al., 2022).

2.3 CBCT Volumetric Segmentation with Centroid Channel Prompt

In volumetric CBCT segmentation, a three-module system is used:

Dentition Detection: Initial U-Net segments tooth vs. background to focus ROI.
Centroid Clustering: A 3D offset network predicts voxel-wise offsets towards centroids. Density-based clustering (Rodriguez et al.) yields candidate tooth centroids, which are normalized to ROI space.
Centroid-Prompted Segmentation: For each target tooth, a binary centroid-prompt volume is encoded (filled at the centroid location). The segmentation network receives the original ROI volume concatenated with the centroid-prompt (two channels), enabling the encoder to propagate the prompt signal throughout the network.
Multi-label Segmentation and Shape Preservation: Network jointly predicts multi-class masks, contours, and signed-distance maps for anatomical fidelity. Losses are multi-task: multi-label Dice, cross-entropy, boundary, and shape-aware regression.

This explicit fusion of centroid-prompted context enables the disentangling of overlap/adhesion and preserves boundary integrity (Ji et al., 21 Nov 2025).

3. Formulations and Loss Structures

All variants ground the multi-label learning via explicit centroids, but their loss frameworks reflect their respective modalities:

Point Cloud Confidence-Aware Loss (SAMTooth):

$\mathcal{L}_{coseg} = \frac{1}{|P_{label}|}\sum_{p_i\in P_{label}} \biggl[c_i\,\mathcal{L}_{CE}(y_i,y_i^{gt}) + (1-c_i)^2 \biggr]$

Foreground and Background Contrastive (SAMTooth):

$\mathcal{L}_{fg} = -\frac{1}{|\mathcal{P}|}\sum_{(i,j)\in\mathcal{P}} \log \frac{\exp(f_i^\top f_j / t)} {\sum_{k\in\mathcal{N}(i)}\exp(f_i^\top f_k / t)}$

with the total loss as weighted sum:

$\mathcal{L} = \lambda_1\,\mathcal{L}_{coseg} + \lambda_2\,\mathcal{L}_{fg} + \lambda_3\,\mathcal{L}_{bg}$

Patch-Based Segmentation Loss (DArch):

$L_{\rm seg} = -\frac1M \sum_{i=1}^M \bigl[ y_i\log p_i + (1-y_i)\log(1-p_i) \bigr]$

CBCT Multi-Label and Shape Losses:

$L_{\mathrm{multi\_dice}} = 1 - \frac{2\sum_{c=1}^{C}\omega_c\sum_{j}p_{j,c}\,q_{j,c}} {\sum_{c=1}^C\omega_c(\sum_j p_{j,c} + \sum_j q_{j,c})}$

$L_{\mathrm{total}} = (L_{\mathrm{multi\_dice}}+L_{\mathrm{multi\_cross}}) + \lambda_1(L_{\mathrm{multi\_dice}}^{\mathrm{bdr}} + L_{\mathrm{multi\_cross}}^{\mathrm{bdr}}) + \lambda_2\,L_{\mathrm{shape}}$

These formulations allow tuning for robust target-specific mask learning and structural preservation.

4. Impact, Performance, and Ablations

The introduction of centroid prompts yields measurable improvements in accuracy and robustness under sparse or weak annotation:

SAMTooth (0.1% label rate): mIoU 76.47%, surpassing prior best weakly-supervised methods by over 9 percentage points, and approaching fully supervised performance (Oracle 83.59%). CPG and MRL provide additive benefit (ablation: CPG +4.63pp mIoU over AGG; full MRL outperforms foreground-only or background-only masking).
DArch: With centroid and arch priors, achieves 99.68% centroid detection accuracy and 95.42% IoU/97.38% Dice in weakly-supervised settings, outperforming prior works even when only ~20% ground-truth masks are provided.
CBCT (Shape-Preserving Segmentation): Ablation shows a +4.9pp Dice gain and ~2.8mm improvement in Hausdorff Distance (HD) upon the addition of centroid prompts. Full system yields Dice 0.9408 (internal/external generalization) and HD 1.20 mm, outperforming recent state-of-the-art baselines (Ji et al., 21 Nov 2025).

The consistent effect is drastic improvement in instance- and shape-aware multi-label segmentation; centroid prompts enable fine localization and robust splitting even in cases of extreme tooth adhesion or minimal labeling.

Method	Modality	Main Prompt Mechanism	Sample Performance (Dice/IoU)	Reference
SAMTooth	3D point cloud	Confidence-filtered centroid	mIoU 76.47% (0.1% labels)	(Liu et al., 3 Sep 2024)
DArch	3D point cloud	VoteNet+Arch APS centroid	Dice 97.38% (weak)	(Qiu et al., 2022)
CMS-Net	CBCT volume	Centroid-channel input	Dice 0.9408, HD 1.20 mm	(Ji et al., 21 Nov 2025)

5. Limitations and Extensions

Documented limitations of centroid-based prompting include:

Reliance on View/Rendering Quality: In SAMTooth, the accuracy of prompt projection and mask quality may degrade under occlusion or poor lighting.
Inference Cost: Multi-view SAM evaluation increases compute time.
Automatic Centroid Detection: While centroid detection is robust in DArch and CBCT frameworks, erroneous centroids can propagate segmentation errors; refinement mechanisms are required.
Differentiability: Many pipelines include non-differentiable steps (e.g., projection, clustering), limiting end-to-end training; future work proposes differentiable 2D–3D mappings.

Potential extensions include multi-view mask fusion, application to other organ segmentation problems under sparse supervision, and integration into clinical workflows as annotation-efficient frameworks (Liu et al., 3 Sep 2024, Qiu et al., 2022, Ji et al., 21 Nov 2025).

6. Relationship to Broader Research Themes

Centroid-prompted learning aligns with trends in prompt-based AI (e.g., SAM, foundation models), spatial prior integration (e.g., dental arch modeling), and weakly- or semi-supervised segmentation research. The approach provides a unified mathematical and architectural scheme for leveraging minimal labels, geometric context, and structure-aware regularization. A plausible implication is the applicability of these strategies to other applications in medical imaging and general 3D semantic/instance segmentation.

7. Conclusions

Target-tooth-centroid prompted multi-label learning constitutes a set of methodological advances that improve 3D and volumetric dental segmentation under minimal supervision. By leveraging explicit or inferred centroids as spatial prompts and fusing centroid-derived signals directly into network inputs, loss formulations, and guidance constraints, these frameworks systematically enhance label efficiency, shape fidelity, and semantic disentanglement. The documented results across diverse modalities and datasets highlight its utility and generality for robust tooth segmentation and similar multi-label instance tasks in complex clinical data (Liu et al., 3 Sep 2024, Qiu et al., 2022, Ji et al., 21 Nov 2025).