Medical Slice Transformer in 3D Imaging

Updated 15 November 2025

Medical Slice Transformer (MST) is a transformer-based architecture that tokenizes individual 2D slices from 3D imaging modalities to capture both intra- and inter-slice dependencies.
It integrates pretrained backbones like DINOv2 and ResNet with transformer encoders to deliver robust performance in classification, segmentation, lesion detection, registration, and triaging.
MST emphasizes explainability by generating attention maps that highlight key slices and anatomical regions, thereby enhancing clinical interpretability and decision support.

The Medical Slice Transformer (MST) is a transformer-based deep learning architecture designed to aggregate and analyze slice-wise information from 3D medical imaging modalities such as MRI and CT. MST frameworks leverage advances in vision transformers and self-supervised learning to achieve parameter efficiency, diagnostic accuracy, and improved explainability in clinical tasks including classification, segmentation, lesion detection, registration, and triaging. Crucially, MST treats individual 2D slices or patches as tokens, allowing the transformer’s self-attention mechanism to capture both intra-slice and inter-slice dependencies. Notable instantiations employ pretrained backbones (e.g., DINOv2, ResNet) for 2D feature extraction, followed by transformer encoders to integrate slice representations into context-aware predictions.

1. Architectural Principles and Core Processing Pipeline

MST architectures process a 3D image volume $V \in \mathbb{R}^{S \times H \times W}$ (with $S$ slices, each sized $H \times W$ ) by encoding each slice through a pretrained 2D feature extractor, most commonly a vision transformer backbone such as DINOv2 (Müller-Franzes et al., 24 Nov 2024, Nguyen et al., 8 Nov 2025, Nascimento et al., 2 Sep 2025) or lightweight CNNs (Jun et al., 2021).

Pipeline Steps:

Slice Feature Extraction: Each slice $i$ is processed by the 2D backbone (usually frozen or fine-tuned), generating a feature embedding $h_{i} \in \mathbb{R}^{d}$ .
Slice-Sequence Tokenization: A learnable [CLS] token is prepended; positional encodings indexed by slice number are optionally added: $x_{i} = W_{p}h_{i} + e_{pos}(i)$ .
Transformer Encoder: The sequence $[x_{0}; x_{1}; \ldots; x_{S}]$ is fed into an $L$ -layer transformer encoder, exploiting multi-head self-attention to correlate slices across the volume.
Classification/Segmentation Head: The [CLS] output $y_{cls}$ is linearly mapped to logits for classification, or slice tokens are further processed for segmentation; in lesion detection and registration, token-wise predictions are made for each slice.
Explainability via Attention Maps: Final-layer attention scores provide explanatory saliency maps, localizing relevant slices and anatomical regions in the predictions.

Example equations for transformer self-attention per layer: $Q_h = X^{(\ell-1)} W_{Q_h},\quad K_h = X^{(\ell-1)} W_{K_h},\quad V_h = X^{(\ell-1)} W_{V_h}$

$\mathrm{Attention}_h = \mathrm{softmax}\left( \frac{Q_h K_h^T}{\sqrt{d_k}} \right) V_h$

$\mathrm{MHA}(X) = [\mathrm{Attention}_1; \ldots; \mathrm{Attention}_H] W_O$

Backbones used in MST variants include DINOv2-with-ViT (patch dimension $d=384$ , patch size $16 \times 16$ , $S=$ 32–38) (Müller-Franzes et al., 24 Nov 2024, Nguyen et al., 8 Nov 2025), ResNet18 for medical transformers (embedding $d=16$ , multi-view) (Jun et al., 2021), and custom U-Net encoders for segmentation (Yan et al., 2021).

2. Self-Supervised Learning, Pretraining, and Domain Adaptation

MST frameworks commonly rely on self-supervised pretraining to overcome annotation scarcity (Müller-Franzes et al., 24 Nov 2024, Jun et al., 2021, Nascimento et al., 2 Sep 2025). DINOv2, adapted in several MST pipelines, uses student-teacher knowledge distillation with augmented views: $L_{\text{self}} = -\sum_{k} q_{t}(k | x') \cdot \log p_{s}(k | x)$ where $q_t$ are teacher probabilities, $p_s$ are student outputs, and network weights are momentum-averaged. Data augmentations are extensive (random flips, rotations, noise, inversion, color jitter).

Medical Transformer (Jun et al., 2021) utilizes a masked encoding prediction task for transfer learning, masking 10% of slice tokens in each anatomical plane (sag/cor/ax) and optimizing reconstruction loss across the tokenized sequence.

In MST+KAN approaches (Nascimento et al., 2 Sep 2025), per-slice embeddings are extracted with the backbone frozen; downstream classifiers are trained, allowing rapid adaptation to new cohorts while maintaining consistency and robustness across scanner vendors.

3. Downstream Tasks and Quantitative Results

MST frameworks are evaluated on diagnosis, lesion detection, triaging, registration, and segmentation across diverse modalities:

Classification and Disease Diagnosis (Müller-Franzes et al., 24 Nov 2024, Jun et al., 2021, Nguyen et al., 8 Nov 2025, Nascimento et al., 2 Sep 2025):

Dataset	Model	AUC (mean ± std)	p-value (DeLong)
Breast MRI	3D ResNet50	0.91 ± 0.02	–
	MST-DINOv2	0.94 ± 0.01	0.02
Chest CT	3D ResNet50	0.92 ± 0.02	–
	MST-DINOv2	0.95 ± 0.01	0.13
Knee MRI	3D ResNet50	0.69 ± 0.05	–
	MST-DINOv2	0.85 ± 0.04	0.001

MST+KAN for breast DCE-MRI achieves AUC $=0.80 \pm 0.02$ , exceeding direct transformer heads (Nascimento et al., 2 Sep 2025).

Triaging in Breast MRI (Nguyen et al., 8 Nov 2025):

Contrast-enhanced and non-contrast protocols are assessed: | Sequence | AUC ± std | [email protected]% Sensitivity | |--------------|-------------|------------------------| | T1sub+T2w | 0.77±0.04 | 19% ± 7% | | DWI1500+T2w | 0.74±0.04 | 17% ± 11% |

At 97.5% sensitivity, MST safely triages 17–19% of cases without BI-RADS ≥4, supporting workload reduction.

Segmentation (Yan et al., 2021):

AFTer-UNet (axial fusion transformer) improves Dice scores over U-Net, TransUNet, CoTr, and Swin-Unet across thoracic and abdominal datasets. Example: Thorax-85 average Dice $92.32\%$ (AFTer-UNet) vs. $91.38\%$ (TransUNet).

Universal Lesion Detection (Li et al., 2022):

SATr block (a plug-and-play MST module) boosts sensitivity in five multi-slice lesion detection backbones by 1–5% at fixed false positive rates, with no cost in network complexity.

Registration (Xu et al., 2022):

SVoRT, an MST for fetal brain MRI, achieves state-of-the-art accuracy (target registration error $4.35 \pm 0.90$ mm, PSNR $25.26$dB) and generalizes from synthetic to real motion-corrupted acquisitions.

4. Explainability and Saliency Analysis

MST architectures explicitly address explainability through transformer attention:

Attention Map Generation: Slice-level ( $a_{\text{slice}}$ ) and patch-level ( $a_{\text{patch}}$ ) attention weights are extracted from the final transformer layers (Müller-Franzes et al., 24 Nov 2024).
Radiologist Scoring: Saliency correctness is assessed on slice and lesion localization; MST shows 80–98% good or moderate ratings on external validation sets (Nguyen et al., 8 Nov 2025).
Comparison with CNNs: Grad-CAM on 3D CNNs often yields diffuse or anatomically imprecise heatmaps; MST attention highlights correct slices and focal regions (Müller-Franzes et al., 24 Nov 2024).

A plausible implication is that transformer-based saliency offers more actionable visual feedback for radiological decision support and adverse event auditing.

5. Specialized Instantiations: Modular Innovations and Adaptations

KAN Head Classifiers (Nascimento et al., 2 Sep 2025): Adaptive B-spline activations in Kolmogorov–Arnold Networks provide locally flexible neural nonlinearities, addressing class imbalance and lesion heterogeneity. Gradient-based updates of spline coefficients (via Cox–de Boor recurrence) confer interpretability.
Multi-Protocol Breast MRI Triaging (Nguyen et al., 8 Nov 2025): MST adapts to T1-weighted subtraction, DWI, and T2-weighted protocols, combining single or multi-sequence inputs via slice-wise tokenization.
Axial Fusion Blocks (Yan et al., 2021): Separate intra-slice and inter-slice self-attention within transformer layers enables AFTer-UNet to leverage both local and long-range 3D cues, with a computational cost comparable to 2D transformers and a parameter count (41.5M) below other 3D transformer baselines.

These innovations suggest MST serves as a foundation model that can be repurposed with varying architectural heads, optimizing for clinical requirements (accuracy, interpretability, portability).

6. Implementation and Practical Guidelines

Preprocessing:

Resample all volumes to standardized input dimensions (e.g., $224 \times 224 \times S$ or $256 \times 256 \times S$ ).
Normalize intensity (min-max, zero-mean/unit variance, or window-clamped depending on modality).
Augment data with flips, rotations, random noise, signal inversion per slice.

Training:

Optimize with AdamW; typical learning rate $1 \times 10^{-6}$ (for finetuning MST), $1 \times 10^{-4}$ (for 3D CNNs).
Batch size limited by GPU memory (often as low as 1–2 volumes).
Early stopping after fixed epochs without validation AUC improvement.

Architecture Parameters:

Transformer: $L=4$ –12 layers; $d=384$ –768 per token; $H=6$ –12 heads; [CLS] token for aggregation.
U-Net Encoders/Decoders: $B=5$ blocks, $C_L=512$ , $H_L=W_L=16$ (for $256^2$ ).
Multi-sequence or multi-view inputs via concatenation or multiple MST streams.

Attention and MLP block depths can be reduced for memory efficiency. Positional embeddings should encode both patch position and slice index to maximize inter-slice ordering fidelity.

7. Limitations, Error Analysis, and Future Directions

Limitations:

Downsampling to $224 \times 224$ or $256 \times 256$ may lose fine structure relevant in thin anatomical regions or subtle lesions (Müller-Franzes et al., 24 Nov 2024, Nguyen et al., 8 Nov 2025).
Most studies employ single-sequence MRI or non-contrast-only CT; multimodal integration remains underexplored (Müller-Franzes et al., 24 Nov 2024).

Error Analysis:

False negatives are predominantly small ( $<10$ mm) or non-mass enhancement lesions (Nguyen et al., 8 Nov 2025).
Domain shift in external validation (e.g., histology vs. BI-RADS) warrants further calibration.

Research Directions:

Multi-stream or multimodal MST—joint analysis of T1, T2, DWI or clinical metadata (Müller-Franzes et al., 24 Nov 2024).
Hierarchical transformer integration for true 3D patch tokenization (Müller-Franzes et al., 24 Nov 2024).
Large-scale prospective trials and multicenter validation.
Architecture search for further reduction in parameter and memory footprint.

This suggests MST may evolve into the core of robust, explainable, and scalable AI models for medical image analysis, provided ongoing advances in training protocol and interpretability.