3DTeethSAM: 3D Dental Segmentation Framework

Updated 19 December 2025

3DTeethSAM is a model architecture for 3D dental segmentation that adapts a 2D segment-anything model to process intraoral meshes using prompt-driven learning and lightweight adapters.
It incorporates a prompt embedding generator, mask refiner, and deformable global attention modules to enhance segmentation accuracy and efficiency while bridging 2D-3D domain gaps.
Empirical evaluations on high-resolution dental benchmarks demonstrate state-of-the-art performance improvements in metrics like T-mIoU, with rapid training convergence.

3DTeethSAM is a model architecture and algorithmic framework for 3D dental segmentation that adapts generalized segment-anything models—specifically SAM2—for the instance and semantic segmentation of teeth in 3D intraoral meshes. Designed to address the complex topology and class imbalance of dental anatomy, 3DTeethSAM integrates advanced neural modules with specialized projective geometry and visually grounded learning to achieve state-of-the-art results on high-resolution dental scan benchmarks (Lu et al., 12 Dec 2025).

1. Problem Definition and Context

3D dental segmentation requires precise localization and classification of individual teeth on detailed digital representations, typically high-resolution triangle meshes produced from intraoral scans. Traditional 3D neural segmentation models demand either large, fully annotated meshes or sophisticated topological feature engineering. The Segment Anything Model 2 (SAM2) is a prompt-driven general foundation model for 2D image and video segmentation. However, applying a class-agnostic 2D foundation model to fine-grained semantic 3D tooth segmentation confronts several technical challenges: the need to bridge 3D data structure with 2D model input requirements, prompt generation and alignment for discrete tooth instances, and remedying blurry or misaligned segmentation boundaries ubiquitous in direct transfer settings (Lu et al., 12 Dec 2025).

2. Rendering Pipeline and 2D–3D Linkage

The approach begins by rendering the 3D mesh $\mathcal{M}(P, F)$ , where $P$ is the vertex set and $F$ the face set, into a discrete set of 2D views. Each mesh is centered and aligned canonically with the crown oriented upwards, and $V$ images at $512\times 512$ resolution are produced from predefined viewpoints (frontal, rear, lateral). Image formation for each view $v$ uses the standard pinhole model:

$I_v = \Pi_v(\mathcal{M}),\quad \forall\, v=1,\ldots, V,$

where the projection is $\pi(K\,[R_v\,|\,t_v]\,p_n)$ for camera intrinsics $K$ and extrinsics $(R_v, t_v)$ . Simultaneously, a 16-channel semantic tooth ground-truth mask $Y_v$ is rendered per image, encoding tooth identity per channel (Lu et al., 12 Dec 2025).

3. Architecture: SAM2 Adaptation via Lightweight Adapters

3DTeethSAM avoids fine-tuning SAM2’s large backbone and instead introduces three lightweight, learnable modules (total $\sim$ 3M parameters) that operate atop image token embeddings but are trained specifically for dental anatomy.

Prompt Embedding Generator (PEG):

A 6-layer Transformer decoder, initialized with $N_q=16$ learnable queries (one per tooth), ingests flattened image embeddings $\mathbf{F} \in \mathbb{R}^{HW\times d}$ from the SAM2 image encoder. Each layer involves self-attention (modeling inter-tooth dependencies) followed by cross-attention to image tokens:

$Q_i^\ell = \mathrm{SA}(Q_i^{\ell-1}) + Q_i^{\ell-1},\quad Q_i^\ell = \mathrm{CA}(Q_i^\ell, \mathbf{F}) + Q_i^\ell.$

Final query vectors $e_i^{\mathrm{prompt}}$ seed the prompt inputs for the frozen SAM2 mask decoder, producing coarse 16-channel masks $\widehat{M}_v$ (Lu et al., 12 Dec 2025).

Mask Refiner:

To correct frequent boundary artifacts and coarse predictions, a UNet-inspired CNN operates on a concatenation of (i) the original image $I_v$ , (ii) coarse segmentation output $\widehat{M}_v$ , and (iii) SAM2 embeddings $\mathbf{F}$ . Each encoder convolution merges the three streams via independent convolutions and channel concatenation, ensuring low- and high-level information retention. Skip connections propagate multi-scale features. Supervision leverages a composite objective: \begin{align*} L_{\mathrm{ce}} &= -\sum_{c=0}^{16}\sum_{x} y_{v,c,x}\,\log\tilde{m}{v,c,x} \ L{\mathrm{dice}} &= 1 - \frac{2\sum_x y\,\tilde{m}}{\sum_x y + \sum_x\tilde{m} + \epsilon} \ L_{\mathrm{bd}} &= \sum_x \left|\nabla \tilde{m}_x - \nabla y_x\right|_1, \end{align*} aggregated as $L_{\mathrm{MR}} = \lambda_{\mathrm{ce}}L_{\mathrm{ce}} + \lambda_{\mathrm{dice}}L_{\mathrm{dice}} + \lambda_{\mathrm{bd}}L_{\mathrm{bd}}$ (Lu et al., 12 Dec 2025).

Mask Classifier:

To assign predicted mask channels to anatomical tooth identities, a parallel 6-layer Transformer decoder produces class logits per mask via an MLP, mapping each query output to one of 16 tooth classes or background. The loss is standard cross-entropy:

$L_{\mathrm{MC}} = -\sum_{i=1}^{16} \sum_{c=0}^{16} y_i(c)\,\log p_i(c).$

Deformable Global Attention Plugins (DGAP):

Inserted into the frozen SAM2 hierarchical Vision Transformer encoder, DGAPs replace standard attention sampling by predicting per-head, per-query sampling offsets, introducing local deformation:

$k = W_k \sum_{k=1}^K w_k\,F(p_k + \Delta p_{qh}), \quad v = W_v \sum_{k=1}^K w_k\,F(p_k + \Delta p_{qh}),$

where $F$ is the feature map and offsets are predicted by a small offset network $\Phi$ . This residual-fused deformable attention incorporates morphology-aware global context, yielding both an accuracy increase (+1.29% T-mIoU) and a 30% faster training convergence (Lu et al., 12 Dec 2025).

4. 2D–3D Label Reconstruction and Postprocessing

After inference on all views, the model back-projects 2D mask predictions to the 3D mesh. Each pixel label is mapped to a vertex via the known projection $\Pi_v^{-1}$ . Each vertex votes among view predictions for majority class assignment. To eliminate projection artifacts and sharpen boundaries, a 3D Graph Cut algorithm further refines the segmentation using unary (majority label) and pairwise (adjacency consistency) terms (Lu et al., 12 Dec 2025).

5. Quantitative Evaluation and Component Analysis

Training employs the Teeth3DS benchmark: 1,800 intraoral scans (900 patients) with 17 annotated classes (16 teeth, 1 gingiva). The model uses AdamW optimization, cosine annealing, label smoothing, and moderate image augmentations. Metrics include overall accuracy (OA), tooth-wise mean Intersection over Union (T-mIoU), boundary IoU (B-IoU), and Dice similarity.

Model/Component	T-mIoU (%)	OA (%)	Dice (%)	B-IoU (%)	Inference (s/scan)
3DTeethSAM (full)	91.90	95.48	94.33	70.05	1.2
ToothGroupNet (prev. SOTA)	90.16	—	—	—	—
3DTeethSAM w/o DGAP	90.61	—	—	66.64	—
3DTeethSAM w/o PEG	52.46	—	—	—	—
3DTeethSAM w/o Mask Refiner	91.10	—	—	68.43	—
3DTeethSAM w/o Mask Classifier	91.31	—	—	67.56	—

Absence of PEG leads to a catastrophic –39.44% T-mIoU drop, confirming that prompt learning is indispensable for 2D–3D model transfer. Both the mask refiner and classifier contribute smaller, yet significant, performance gains and boundary sharpening. Inference latency is comparable to specialized 3D architectures (Lu et al., 12 Dec 2025).

Related approaches, such as SAMTooth, transfer prompt-driven foundation models to 3D point cloud segmentation but focus primarily on weakly supervised settings (e.g., 0.1% annotation density). SAMTooth employs confidence-aware prompt generation, mask-guided representation learning (via reprojected 2D masks guiding a contrastive feature loss in 3D), and demonstrates significant performance gains over weakly supervised and semi-supervised baselines (e.g., II-Model, MT, PSD, SQN), achieving mIoU = 76.47% with minimal supervision (Liu et al., 3 Sep 2024). This suggests that prompt-tuned 2D foundation models can provide substantial supervision and regularization for 3D segmentation even under sparse labels, though they incur projection overhead and are still sensitive to prompt quality and confidence estimation.

3D-U-SAM, another related method, leverages pre-trained SAM weights with 3D convolution approximations and U-Net inspired skip connections for CBCT dental image segmentation, addressing sample scarcity without detailed point-wise annotation. The precise architectural differences with 3DTeethSAM (SAM2 adapters vs. 3D-U-Net skip fusion and convolutional approximation) reflect evolving strategies for multi-dimensional adaptation (Zhang et al., 2023).

7. Limitations and Prospects

3DTeethSAM requires multi-view image rendering and computation of 2D–3D correspondences, which may add pre/post-processing constraints compared to pure 3D or point-based models. The overall pipeline depends on the fidelity of view alignment and the ability of learned prompts to robustly generalize for anatomical variability. A plausible implication is that direct end-to-end learning of prompt embeddings conditioned on 3D spatial context, combined with adaptive thresholding and multi-view consistency, could further enhance segmentation robustness and transferability. Application of deformable attention within the backbone, as pioneered with DGAP, exemplifies the potential for morphology-aware adaptation of generic vision transformers to medical and structural segmentation tasks (Lu et al., 12 Dec 2025, Liu et al., 3 Sep 2024).

In summary, 3DTeethSAM defines a paradigm for adapting prompt-driven, class-agnostic 2D vision models to structured 3D anatomical segmentation, achieving high accuracy with minimal trainable adaptation and establishing new quantitative benchmarks in dental mesh analysis.