Teeth Segmentation Accuracy

Updated 9 February 2026

Teeth Segmentation Accuracy (TSA) is defined as a set of metrics that quantify how precisely dental image segmentation models delineate individual teeth relative to expert annotations.
It incorporates methods such as IoU, DSC, and boundary-specific metrics across diverse modalities like 2D radiographs, 3D scans, and CBCT to ensure robust evaluation.
High TSA supports clinical workflows by enabling accurate diagnosis, improved treatment planning, and reliable model validation in digital dentistry.

Teeth Segmentation Accuracy (TSA) is a central metric family for evaluating the fidelity of automated dental image segmentation models across 2D radiographs, 3D intraoral scans, and volumetric imaging such as cone-beam CT (CBCT). TSA quantifies how accurately a system delineates individual teeth (or tooth regions) relative to expert-annotated ground truth, typically capturing both pixelwise or pointwise correctness, spatial overlap, and boundary integrity. Robust TSA metrics underpin the validation and comparison of deep learning models in digital dentistry, enabling precise diagnosis, treatment planning, and a variety of clinical workflows.

1. Core Definitions and Variants of TSA

The most widely adopted TSA metrics are derived from classical segmentation performance indices but are tailored to the dental domain’s instance- and boundary-specific challenges.

Intersection-over-Union (IoU):

For binary or multi-class mask $P$ (prediction) and $G$ (ground truth),

$\mathrm{IoU} = \frac{|P \cap G|}{|P \cup G|}$

This is frequently averaged over all tooth instances, yielding a mean IoU (mIoU) as a global measure (Dhar et al., 2023, Mustakim et al., 23 Nov 2025, Xi et al., 31 Mar 2025).

Dice Similarity Coefficient (DSC):

$\mathrm{DSC} = \frac{2|P \cap G|}{|P| + |G|}$

Closely related to IoU via

$\mathrm{DSC} = \frac{2\,\mathrm{IoU}}{1+\mathrm{IoU}}$

Accuracy (ACC/OA):

Pointwise or pixelwise accuracy,

$\mathrm{ACC} = \frac{TP + TN}{TP + TN + FP + FN}$

where $TP$ , $TN$ , $FP$ , $FN$ are true/false positives/negatives (Dhar et al., 2022, Mustakim et al., 23 Nov 2025, Zhao et al., 2022).

Boundary Metrics:
- Rotated IoU (RIoU): Quantifies the overlap between oriented bounding boxes (OBBs) of predicted and ground truth masks (Dhar et al., 2023).
- Boundary IoU: Restricts overlap computation to crown–gingiva or tooth-tooth junctions (Xi et al., 31 Mar 2025).
- Hausdorff Distance (HD): Measures maximal deviation between predicted and GT boundaries (Zhang et al., 2024).
Composite/Challenge Scores:

Leaders in public challenges employ composite TSA scores aggregating multiple terms, such as:

$\mathrm{TSA\,Score} = w_1\,\mathrm{Dice} + w_2\,\mathrm{IoU} + w_3\,(1 - H(d))$

with empirically derived weights (Zhang et al., 2024).

Instance-Level F1/TSA (3D challenges):

$\mathrm{TSA} = \frac{1}{|\mathcal{T}|} \sum_{t\in\mathcal{T}} F1_t$

where $F1_t$ is the harmonic mean of precision and recall for tooth $t$ (Ben-Hamadou et al., 2023).

2. Quantification and Evaluation Protocols

TSA computation is protocol- and modality-dependent:

2D Panoramic X-rays:
- Per-pixel comparison of predicted and GT binary or multiclass masks.
- Instance metrics (IoU, Dice) are averaged over the set of teeth in each image and then across the dataset (Dhar et al., 2023, Ghafoor et al., 2023).
- Some models include orientation-aware metrics (RIoU) using OBBs derived via PCA over the mask contour (Dhar et al., 2023).
3D Scans/Meshes:
- Per-point or per-face label comparison; TSA given as overall accuracy or mIoU (Xiong et al., 2022, Xi et al., 31 Mar 2025).
- For challenge settings, TSA is averaged per tooth instance (instance-level F1), as in 3DTeethSeg’22 (Ben-Hamadou et al., 2023).
- Additional evaluation of boundary quality is realized through boundary IoU or HD (Xi et al., 31 Mar 2025, Zhang et al., 2024).
Volumetric Imaging (CBCT):
- Voxelwise segmentation accuracy using Dice, IoU, and surface-based distances (HD, ASSD) (Jang et al., 2021, Chung et al., 2020).
Domain-Aware/Hybrid Metrics:

Frameworks such as ViSTooth compute additional metrics (shape via Hu moments, position, angle) and aggregate them for composite TSA assessment (Zhu et al., 2024).

Common Experimental Setups

Cross-validation and held-out test sets are standard for reporting TSA (Dhar et al., 2023, Dhar et al., 2022).
Challenge protocols require locked test sets and no access to ground truth, demanding strict generalizability (Ben-Hamadou et al., 2023, Zhang et al., 2024).
Ablation studies systematically measure TSA gains from architectural innovations, loss terms, and boundary-focused strategies.

3. Model Architectures and Loss Functions Optimizing TSA

Architectures achieving state-of-the-art TSA integrate domain-specific advances:

Encoder–Decoder and Attention Mechanisms:

EfficientNet-B7 encoders, grid-based attention gates, and parallel squeeze-excitation modules in FUSegNet (Dhar et al., 2023); recurrent convolutional modules and residual bridges in S-R2F2U-Net (Dhar et al., 2022); M-Net U-shape with Swin transformers and tooth-dedicated attention blocks (TAB) (Ghafoor et al., 2023).

Boundary-aware Modules:

Reverse-attention-based boundary extraction (BFEM), feature cross-fusion (FCFM) (Zhang et al., 2024), instance-boundary loss (Cai et al., 30 Dec 2025), and contrastive learning at the tooth–gingiva interface (Xi et al., 31 Mar 2025).

Proposal-free Instance Segmentation:

Transformer-style mask embedding heads enable robust handling of missing or malposed teeth (Cai et al., 30 Dec 2025).

Geometry-guided Losses:

Curvature-aware focal losses upweight high-curvature (boundary) points, leading to improved TSA and smoother boundaries (Xiong et al., 2023, Xiong et al., 2022, Cai et al., 30 Dec 2025).

Hybrid and Regularization Losses:

Hybrid Dice + Focal/cross-entropy (Dhar et al., 2023, Dhar et al., 2022), squared Dice for class imbalance (Ghafoor et al., 2023), L2 regularization (Budagam et al., 2024), and composite losses balancing global and boundary-focused terms (Zhang et al., 2024).

4. Quantitative Benchmarks

Multiple models and challenges provide direct comparative values for TSA across varying datasets and modalities:

Model / Setting	TSA Metric	Value(s)	Dataset / Task
DE-KAN (Mustakim et al., 23 Nov 2025)	Accuracy / Dice	98.91% / 97.1%	CDPR 2D radiographs
iMeshSegNet (Wu et al., 2021)	Dice	0.964 ± 0.054	3D intraoral mesh
FUSegNet+AG+P-scSE (Dhar et al., 2023)	IoU / Dice / RIoU	82.43% / 90.37% / 82.82%	Panoramic X-ray
S-R2F2U-Net (Dhar et al., 2022)	Accuracy / Dice	97.31% / 93.26%	Dental X-ray
CGIP@3DTeethSeg'22 (Ben-Hamadou et al., 2023)	Instance-level F1	0.9859	3D challenge
BATISNet (Cai et al., 30 Dec 2025)	mIoU / mAP	84.42% / 81.93%	Point cloud instance
BFFNet (Zhang et al., 2024) (STS Challenge)	"TSA score"	0.91	2D challenge
TSegFormer (Xiong et al., 2023)	TSA (Acc) / mIoU	97.97% / 94.34%	3D IOS, 16,000 scans
CrossTooth (Xi et al., 31 Mar 2025)	mIoU / Bound. IoU	95.86% / 82.06%	3D mesh, boundary
TSGCN (Zhao et al., 2022)	OA / mIoU	96.96% / 91.69%	Mesh, 3D scanner

Best-in-class 2D deep learning models (DE-KAN, FUSegNet variants, OralBBNet) approach ≥98% accuracy and ≥97% Dice; 3D transformer-based methods (TSegFormer, TFormer) achieve ≥97.8% overall pointwise accuracy with mIoU in the mid-90% range. On more challenging boundary and instance-level metrics, recent instance-segmentation aware designs (BATISNet, TSegFormer, CrossTooth) show gains of 2–7 percentage points over prior semantic baselines with improved separation in adverse scenarios (e.g., malposed or missing teeth).

5. Critical Factors Impacting High TSA

The primary determinants of elevated TSA include:

Boundary modeling: Explicit attention to boundaries, via reverse-attention, boundary-aware losses, or curvature-guided sampling, significantly improves TSA in regions prone to tooth adhesion or annotation ambiguity (Zhang et al., 2024, Xiong et al., 2023, Xi et al., 31 Mar 2025, Cai et al., 30 Dec 2025).
Instance-awareness: Proposal-free instance heads and transformer-based querying overcome the limitations of semantic-only approaches, especially for cases of missing, fused, or supernumerary teeth (Cai et al., 30 Dec 2025).
Multi-scale and multi-modal fusion: Integration of local and global features through cross-fusion modules, multi-path encoders, or dual-stream graph networks enhances anatomical coherence (Ghafoor et al., 2023, Mustakim et al., 23 Nov 2025, Zhao et al., 2022).
Annotation quality and data augmentation: Comprehensive annotations and augmentation strategies expand generalization over variable anatomy, ages, and imaging artifacts (e.g., metal in CBCT) (Xiong et al., 2023, Xi et al., 31 Mar 2025, Chung et al., 2020).
Loss function design: Incorporation of hybrid/weighted loss functions provides a more balanced optimization for both foreground (teeth) and challenging margins (Dhar et al., 2023, Zhang et al., 2024).

6. Limitations and Future Directions

Despite significant advances, several challenges persist:

Boundary degradation under label noise or for highly worn, supernumerary, or missing teeth remains a major source of error (Zhang et al., 2024, Xi et al., 31 Mar 2025).
Generalization to atypical anatomies (edentulous jaws, complex prosthetics) requires further expansion of training sets and semantically-aware post-processing (Xi et al., 31 Mar 2025).
Computational cost: High-accuracy models (DE-KAN, TSegFormer, BATISNet) may require greater memory and inference time; efficiency improvements and knowledge distillation are noted areas for future research (Mustakim et al., 23 Nov 2025).
Weak or unsupervised learning: Reducing annotation burden and addressing out-of-distribution generalization can further broaden clinical applicability (Zhang et al., 2024, Kunzo et al., 2023).
Clinical metrics: Continued refinement of composite TSA scores combining shape, position, orientation, boundary, and overlap metrics is advocated for translational robustness and explainability (Zhu et al., 2024).

7. Research Directions and Best Practices

Hybrid multi-objective loss and deep supervision across decoder stages can consistently raise TSA (Zhang et al., 2024, Ghafoor et al., 2023).
Boundary-aware training and augmentation, such as focal losses on high-curvature points, yield sharper boundaries and improve segmentation where it is most clinically relevant (Cai et al., 30 Dec 2025, Xi et al., 31 Mar 2025, Xiong et al., 2023).
Model selection and validation should leverage multiple TSA metrics (Dice, IoU, boundary scores, instance-level F1, and domain-specific attributes) to avoid overfitting to a single index (Ben-Hamadou et al., 2023, Zhu et al., 2024).
Human-in-the-loop retraining and visual analytics frameworks (e.g., ViSTooth’s glyph + scatterplot) accelerate the discovery and remediation of rare segmentation errors, supporting continual improvement (Zhu et al., 2024).

Teeth Segmentation Accuracy thus encapsulates a suite of metrics and practices. Continued progress depends on not just absolute scores, but a nuanced, region- and instance-aware assessment of segmentation reliability—especially at clinically critical boundaries. This evolving landscape mandates coordinated improvement in modeling, loss design, data curation, and metric selection, as exemplified by the leading approaches in 2D/3D imaging and international benchmark challenges.