ZXYFormer for Dental CBCT Segmentation

Updated 17 October 2025

ZXYFormer is a neural architecture that uses a coarse-to-fine segmentation framework with weight transfer to efficiently process large volumetric CBCT images.
It features an encoder–decoder design integrated with an inverse feature fusion transformer module to combine global morphological cues with detailed local features.
The model incorporates an uncertainty estimation branch that refines weak boundaries and quantifies segmentation confidence, enhancing clinical reliability.

ZXYFormer is a neural architecture specifically developed for the simultaneous segmentation of teeth and root canals from clinical CBCT (cone beam computed tomography) images. The method innovatively addresses challenging issues inherent to volumetric dental imaging, namely the large data size, complex and disparate morphologies of teeth and root canals, and the presence of weak or ambiguous anatomical boundaries. To accomplish this, ZXYFormer features a hybrid coarse-to-fine segmentation framework with weight transfer, an encoder–decoder backbone leveraging an inverse feature fusion transformer (the "ZXYFormer" module), and an uncertainty-estimation-guided refinement branch.

1. Coarse-to-Fine Segmentation Framework with Weight Transfer

CBCT images used in dental practice are volumetrically large (on the order of $672 \times 688 \times 688$ voxels), while the anatomical structures of interest occupy a small subset of this space. Directly applying segmentation networks on full-resolution volumes is prohibitively expensive in computation and memory, and conventional downsampling risks losing fine feature delineation—especially at weak boundaries.

ZXYFormer implements a two-stage scheme:

Coarse Stage: A 3D CNN is applied to aggressively downsampled image cubes (e.g., $128 \times 128 \times 128$ ) to generate a rough segmentation mask and to localize the target regions. This stage efficiently aggregates global (macro) contextual cues.
Weight Transfer and Fine Stage: Macro-level features and learned parameters from the coarse network are transferred (migrated) and used to initialize the fine segmentation network. The fine stage, applied to high-resolution image regions defined in the coarse mask, leverages this context for precise, detail-oriented segmentation.

This design ensures that fine segmentation is informed by global context while operating at the spatial granularity needed to resolve dental and root-canal anatomy.

2. Encoder–Decoder Architecture with ZXYFormer Integration

The fine segmentation network in ZXYFormer adopts a standard encoder–decoder topology:

Encoder: Stacks of 3D convolutional layers extract localized high-level feature maps denoted as $F$ , progressively reducing the spatial dimensions while increasing semantic abstraction.
ZXYFormer Modules: At key points after the encoder and along skip connections in the decoder, the ZXYFormer module acts to enhance and refine representations by integrating global, morphological, and channel-wise information. The placement within skip connections allows fusing both bottom-up and top-down cues.
Decoder: Features are upsampled using convolutional upsampling operators; skip connections inject multi-resolution features, which are further modulated by ZXYFormer before outputting the segmentation mask.

This design allows the network to progressively refine coarse masks with higher-resolution cues, specifically enhancing obscure boundary regions.

3. ZXYFormer Module: Inverse Feature Fusion Transformer

The ZXYFormer module is the architectural core, designed to "invert" conventional feature fusion by projecting deep, global morphological representations toward shallower layers. It comprises three sequential processes:

a) Z Process

Function: Upsamples high-level features, expands channels using $1 \times 1 \times 1$ convolutions, and applies feature normalization.
Role: Prepares high-level morphological information for fusion, ensuring compatibility in shape and dimensionality with earlier feature maps.

b) X Process

Mechanism: Implements a Deformable Reverse Cross Transformer (DRCT).
- The reverse directionality indicates that deeper, global features guide the modulation of shallower representations rather than the traditional bottom-up flow.
- A multi-head cross-attention mechanism enables localization of structurally important regions.
- Deformable convolution (DC) is incorporated to adaptively model the highly variable shapes of teeth and root canals and to enhance responses to weak boundaries or irregularities.

c) Y Process

Operation: Applies a standard feed-forward network (FFN) to the fusion output, then compresses channels to match the required dimensions for subsequent processing stages.
Outcome: Delivers an enhanced representation encapsulating both morphological overview and microstructural nuances, crucial for resolving ambiguous regions.

The sequential Z–X–Y design enables the transfer of macro-level detail to finer representations, specifically improving the discrimination of weak dental boundaries.

4. Uncertainty Estimation and Auxiliary Branch

Segmentation ambiguity, especially at weakly defined edges or in the presence of anatomical anomalies, is addressed via a dedicated uncertainty estimation mechanism:

Architecture: The fine segmentation network features a main classifier branch and an auxiliary branch.
Quantification: For each voxel, the Kullback-Leibler (KL) divergence between the predicted probabilities $p_\text{main}$ (main branch) and $p_\text{aux}$ (auxiliary branch) is used to compute uncertainty:

$L_{\mathrm{Un}} = \exp\left(p_{\mathrm{aux}} \cdot \log \frac{p_{\mathrm{main}}}{p_{\mathrm{aux}}}\right)$

Training Objective: The total loss is the sum of standard cross-entropy loss, Dice loss (main branch), and $L_{\mathrm{Un}}$ . Minimization of $L_{\mathrm{Un}}$ encourages resolution of uncertain predictions at ambiguous regions with no additional parameter count.
Significance: This mechanism both improves the delineation of weak edges and provides a quantifiable estimate of segmentation confidence, an important consideration in clinical settings.

5. Advances over Preceding Segmentation Methods

The ZXYFormer architecture introduces several improvements relative to extant CBCT segmentation approaches:

Challenge	Previous Methods	ZXYFormer Approach
Large Image Volume	Downsampling, risk of information loss	Coarse-to-fine with weight transfer
Morphological Diversity	Edge- or center-based cues only	Global-local integration via transformer
Weak Boundary Segmentation	Simple/edge-specific masks	Inverse feature fusion, uncertainty guidance

ZXYFormer demonstrates enhanced Dice scores and more accurate edge localization in empirical evaluations on 157 high-resolution clinical CBCT datasets. The incorporation of both global and local cues, facilitated by the ZXYFormer module and enabled by a two-stage pipeline, renders the system robust to the anatomical complexities inherent in dental and endodontic imaging.

6. Architectural Significance and Practical Implications

By systematically addressing computational scalability, object-level morphological variation, and boundary ambiguity, ZXYFormer presents a comprehensive framework tailored for volumetric biomedical image segmentation. The architecture integrates convolutional and transformer-based paradigms, successfully merging the strengths of each: spatial locality and global self-attention.

A plausible implication is that the architectural principles underlying ZXYFormer—most notably, inverse feature fusion and staged uncertainty quantification—could transfer to other volumetric segmentation tasks characterized by large data, fine boundaries, and structural heterogeneity. The method's reliance on cross-attention with deformable kernels is particularly relevant for tasks where shape adaptation is critical.

While the approach demonstrates clear empirical benefits for CBCT-based dental imaging, the generalizability of inverse feature fusion transformers and task-specific uncertainty refinement merits investigation in broader contexts.

7. Summary

ZXYFormer represents a specialized architecture that fuses a coarse-to-fine segmentation pipeline, an inverse feature fusion transformer (ZXYFormer module), and explicit uncertainty estimation for improved performance on teeth and root canal segmentation in CBCT volumes. This composite strategy achieves robustness and accuracy in delineating complex, ambiguous dental structures, with demonstrated advantages over conventional volumetric segmentation methodologies.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to ZXYformer Architecture.