3D Medical Image Segmentation
- 3D Medical Image Segmentation is the process of labeling every voxel in a volumetric scan, enabling precise delineation of anatomical and pathological structures.
- Recent advances integrate probabilistic pseudo-labeling, cross-teaching between 3D and 2D networks, and transformer-based architectures to boost accuracy while reducing annotation effort.
- Ongoing challenges focus on balancing efficient label usage with high segmentation precision, managing computational demands, and ensuring robustness across diverse imaging modalities.
Three-dimensional (3D) medical image segmentation is the process of assigning a semantic label to every voxel within a volumetric medical scan, such as computed tomography (CT) or magnetic resonance imaging (MRI). Accurately delineating anatomical structures and pathological regions in 3D is vital for clinical diagnosis, treatment planning, and quantitative disease assessment. Modern research in this area spans supervised deep learning with full annotation, weak and semi-supervised paradigms, transformer-based architectures, computationally efficient and resource-constrained networks, as well as advances in active learning, pre-training, interactivity, and domain adaptation.
1. Core Problem and Motivation
The objective of 3D medical image segmentation is to construct a mapping , where denotes a volumetric image (with D slices, H height, W width), and denotes the corresponding voxel-wise semantic label map for K anatomical or pathological classes. Fully supervised segmentation methods rely on densely annotated 3D ground-truth labels for every training volume, but such annotation is extremely labor-intensive, often requiring over an hour per scan. This bottleneck has motivated research into more label-efficient regimes and new learning paradigms, such as weakly supervised learning, semi-supervised learning, unsupervised clustering, and domain-adaptive transfer (Jiang et al., 2024).
2. Algorithmic Frameworks and Architectural Advances
State-of-the-art 3D segmentation frameworks are typically based on encoder-decoder architectures extended from U-Net and V-Net. These are further enhanced by integrating transformers, local/global self-attention, and multi-scale context modules. Several notable algorithmic strategies have recently emerged:
- Probabilistic-aware Weakly Supervised Segmentation: (Jiang et al., 2024)
- Pseudo Label Generation: Sparse surface points are sampled using farthest-point sampling on an eroded organ mask (e.g., 400 for liver, 200 for spleen), and each point generates a probabilistic map via an isotropic Gaussian. These are summed and normalized to obtain a dense pseudo-label field, which encodes spatial confidence and uncertainty.
- Probabilistic Multi-head Self-Attention (PMSA): Extends transformer attention to model uncertainty by treating attention scores as Gaussian-distributed (parameterized by MLP) and using the reparameterization trick for differentiable stochastic sampling. Multiple Monte Carlo attention samples are averaged at inference.
- Loss: Combines Dice loss on thresholded pseudo-label foreground, a probability-weighted cross-entropy (adapts supervision strength by annotation confidence), and a KL divergence term regularizing attention distributions.
- Results: Surpasses point- and scribble- supervised baselines on BTCV and CHAOS datasets by large Dice margins (up to 18.1%/58.4%) and approaches or exceeds the performance of some fully supervised methods, while requiring only sparse point annotation.
- Cross-Teaching Between 3D and 2D Networks from Sparse Slices: (Cai et al., 2023)
- Framework: Trains a primary 3D V-Net on sparsely annotated volumes, alongside two 2D U-Nets on transverse/coronal planes. Each network generates predictions and shares pseudo-labels with its peers using confidence-based thresholding and fusion strategies.
- Pseudo-label selection: Employs hard-soft thresholding for 3D→2D and intersection-consistency for 2D→3D supervision. Training loss is a weighted hybrid of cross-entropy and Dice, with ramped-up pseudo-label weights.
- Results: Attains Dice that matches or exceeds fully supervised performance (82.67% vs. 81.69%) using only 16% of slices with cross-annotation.
- Self-Supervised and Pre-Training Strategies: (Tadokoro et al., 2024)
- Primitive Geometry Segment Pre-training (PrimGeoSeg): Pre-trains on synthetic volumes containing multiple randomly parameterized 3D primitives (e.g., cones, prisms, ellipses) using multi-class segmentation. Fine-tuning on medical data improves SwinUNETR Dice by up to +4.4% over scratch and achieves performance competitive with state-of-the-art self-supervised methods.
- Boundary- and Geometry-Focused Mechanisms:
- vMixer (LVSA+GVM): Combines local volume-based self-attention (high-res stages for precise boundaries) and volumetric MLP-Mixer blocks (low-res global context) in a hybrid encoder-decoder. Achieves state-of-the-art HD95 and Dice across Synapse, MSD Liver, and Pancreas datasets by capturing both fine and coarse spatial dependencies (Kareem et al., 2024).
- TokenSeg: Compresses a 3D volume into sparse, boundary-aware tokens via a hierarchical encoder and VQ-VAE quantization, then reconstructs masks via progressive upsampling. Achieves 94.49% Dice on breast DCE-MRI and reduces memory/inference time by over 60%, with >60% of tokens focused near boundaries (Zeng et al., 8 Jan 2026).
- GCNV-Net: Introduces Nonvoid Voxelization (sparsifies input by discarding spatial background), a Tri-directional Dynamic Nonvoid Voxel Transformer (partitioned attention along axial planes), and Geometry-aware Cross-Attention for multi-scale fusion. Cuts FLOPs/latency by >56%/>68% and achieves SOTA on five benchmarks, especially in boundary-sensitive metrics (Yuan et al., 7 Apr 2026).
- Transformer Architectures & Complexity Reduction:
- 3D TransUNet: Hybridizes U-Net with transformer encoder and/or decoder; transformer encoders excel at multi-organ context, decoders at small/complex regions. Outperforms nnU-Net, nnFormer, MedNeXt on several public datasets (Chen et al., 2023).
- UNETR++: Efficient Paired Attention reduces transformer attention complexity from quadratic to linear by branching into spatial/channel pathways with shared Q/K projections; achieves 87.2% Dice on Synapse with 71% fewer parameters/FLOPs than nnFormer (Shaker et al., 2022).
3. Label-Efficiency, Active Learning, and Unsupervised/Semi-supervised Regimes
Efficient use of annotation is a critical research theme:
- Attention-Guided Active Learning: (Zhang et al., 2019)
- Dual attention modules embedded in a 3D U-Net are used for per-slice uncertainty estimation. Annotators are only asked to label slices with highest predicted uncertainty, reducing annotation requirements to ~15–20% of slices for brain extraction and ~30–35% for tissue segmentation, while maintaining full-annotation accuracy.
- Cross-Teaching Semi-Supervision: (Cai et al., 2023)
- Leverages complementary 2D/3D views for selective pseudo-label propagation, ensuring high-quality supervision even when only a small fraction of slices are manually labeled.
- Unsupervised Deep Feature Clustering: (MORIYA et al., 2018)
- Alternates between clustering deep features from a 3D CNN (using agglomerative clustering) and network update (cross-entropy using cluster pseudo-labels), before final k-means segmentation. Achieves higher normalized mutual information versus purely intensity-based or Otsu thresholding.
4. Domain-Transfer and Foundation Models
The adaptation of large-scale vision foundation models for 3D medical segmentation is an active frontier:
- SAM and Its Adaptations:
- MA-SAM: Injects 3D adapters and low-rank fine-tuning increments into each block of SAM's ViT backbone, equipping its 2D architecture to process 3D stacks. Achieves 0.9–2.6% higher Dice than nnU-Net on BTCV and MRI prostate, and with a prompt, can double segmentation accuracy on challenging tumors (Chen et al., 2023).
- SAM2-3dMed: Extends SAM2’s video segmentation capabilities to 3D by introducing modules for slice relative position prediction (SRPP, for bidirectional inter-slice context) and explicit boundary detection, thus bridging the temporal-to-spatial gap. Outperforms all baselines (e.g., +1.7–3.0% in Dice over best prior), achieving SOTA on lung, spleen, and pancreas (Yang et al., 10 Oct 2025).
- Interactive SAM2: Zero-shot application on 3D volumes with single-slice annotation and bidirectional mask propagation significantly reduces manual effort, matching or nearly matching fully supervised models for large organs but with remaining challenges for small/deformable structures (Shen et al., 2024).
5. Computational Efficiency, Memory, and Full-volume Training
3D segmentation is computationally demanding due to volumetric scale. Several approaches address this challenge:
- Compact and Efficient Architectures:
- CAN3D: Employs a single down/up path with compact context aggregation (multi-dilated convs, AdaIN), enabling full-volume training and inference on limited GPU memory, with 0.17M params compared to >1M in U-Net. Achieves higher accuracy, lower surface distances, and ~60% lower runtime (Dai et al., 2021).
- Data-swapping for Large Inputs: Systematically swaps intermediate U-Net features between GPU and CPU memory to allow unpatched (192³) volume training with standard GPUs, increasing mean Dice by up to 5.3% and decreasing total training time by 3.5× versus patching (Imai et al., 2018).
6. Evaluation and Comparative Benchmarks
Standardized datasets (BTCV, Synapse, MSD, BraTS, LiTS, CHAOS, AMOS2022, ACDC) are widely used, with common metrics including Dice similarity coefficient, Jaccard index, Hausdorff distance (HD95), and average surface distance. Notable performance highlights include:
| Model | Dataset | Dice (%) | HD95 (mm) | Annotation Regime |
|---|---|---|---|---|
| Probabilistic-Aware (ours) | BTCV Spleen | 82.79 | N/A | Points only |
| vMixer | Synapse | 86.53 | 6.78 | Full supervision |
| TokenSeg | Breast DCE-MRI | 94.49 | 3.8 | Full supervision |
| GCNV-Net | BraTS2021 | 92.06 | 1.22 | Full supervision |
| 3D-2D Cross-Teaching | MMWHS | 82.67 | 8.60 | ~16% label slices |
| CAN3D | OAI-ZIB Knee FC | 87.1 | 8.24 | Full supervision |
Higher HD95 reflects boundary uncertainty; most weakly supervised methods trade slight boundary degradation for drastic annotation reduction.
7. Ongoing Challenges and Future Directions
Key challenges in 3D medical image segmentation include:
- Balancing annotation efficiency and segmentation accuracy, particularly for small, flat, or highly variable anatomical structures.
- Reducing memory, FLOPs, and inference time while maintaining or increasing accuracy, for both research translation and clinical deployment.
- Robustness across imaging modalities, contrast protocols, scanner vendors (domain shift), and rare/pathological anatomies.
- Effective fusion of self-supervised, pre-trained, and foundation models for highly data-efficient, generalizable segmentation.
- Integration of user-in-the-loop or interactive refinement, especially leveraging strong foundation model priors with clinical editing tools.
Potential future extensions include semi-supervised mixtures of dense and sparse labels, adaptive uncertainty modeling (e.g., learnable σ in probabilistic pseudo-labeling), active learning for optimal annotation selection, and more generalizable architectures combining CNN, transformer, and MLP-mixer blocks at different stages.
References
- (Jiang et al., 2024) Enhancing Weakly Supervised 3D Medical Image Segmentation through Probabilistic-aware Learning
- (Cai et al., 2023) 3D Medical Image Segmentation with Sparse Annotation via Cross-Teaching between 3D and 2D Networks
- (Tadokoro et al., 2024) Primitive Geometry Segment Pre-training for 3D Medical Image Segmentation
- (Kareem et al., 2024) Improving 3D Medical Image Segmentation at Boundary Regions using Local Self-attention and Global Volume Mixing
- (Zeng et al., 8 Jan 2026) TokenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression
- (Yuan et al., 7 Apr 2026) Geometrical Cross-Attention and Nonvoid Voxelization for Efficient 3D Medical Image Segmentation
- (Chen et al., 2023) 3D TransUNet: Advancing Medical Image Segmentation through Vision Transformers
- (Shaker et al., 2022) UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
- (Zhang et al., 2019) A sparse annotation strategy based on attention-guided active learning for 3D medical image segmentation
- (Chen et al., 2023) MA-SAM: Modality-agnostic SAM Adaptation for 3D Medical Image Segmentation
- (Yang et al., 10 Oct 2025) SAM2-3dMed: Empowering SAM2 for 3D Medical Image Segmentation
- (Shen et al., 2024) Interactive 3D Medical Image Segmentation with SAM 2
- (MORIYA et al., 2018) Unsupervised Segmentation of 3D Medical Images Based on Clustering and Deep Representation Learning
- (Imai et al., 2018) Fast and Accurate 3D Medical Image Segmentation with Data-swapping Method
- (Dai et al., 2021) CAN3D: Fast 3D Medical Image Segmentation via Compact Context Aggregation