SRPP in 3D Med Image Segmentation
- The paper introduces SRPP as a self-supervised auxiliary module that predicts relative slice positions, enhancing inter-slice anatomical consistency.
- SRPP integrates a Transformer encoder and a lightweight MLP to model inter-slice dependencies, significantly improving segmentation overlap and boundary delineation.
- Quantitative results demonstrate that SRPP boosts Dice performance and reduces boundary errors, proving its effectiveness as a training regularizer in 3D medical segmentation.
Slice Relative Position Prediction (SRPP) is a self-supervised auxiliary module introduced in the SAM2-3dMed framework to impose explicit bidirectional anatomical awareness on feature representations for 3D medical image segmentation. Designed to address the limitations of adapting video-centric architectures such as SAM2—originally reliant on temporal continuity—to the spatially contiguous context of volumetric medical data, SRPP guides the encoder network to model inter-slice dependencies by predicting the relative positions between arbitrary slice pairs in a reinforced embedding space. The SRPP module operates only during training, serving as a strict regularizer to encourage anatomically consistent feature hierarchies, resulting in improved segmentation overlap, boundary delineation, and inter-slice mask coherence (Yang et al., 10 Oct 2025).
1. Architectural Placement and Integration
SRPP is positioned in parallel with the segmentation and boundary detection branches within the SAM2-3dMed framework. The frozen SAM2 Image Encoder produces volumetric features , where is channel count, the number of slices, and the spatial dimensions. Each slice’s feature map, , is processed by the SRPP module. Feature vectors corresponding to individual slices are aggregated and passed through a dedicated Transformer encoder, yielding contextually enriched slice embeddings. These embeddings are the basis for pairwise relative position prediction across all slice combinations, with subsequent gradients propagated into the frozen backbone to enforce feature discriminability aligned with anatomical continuity.
2. Input/Output Specification and Data Flow
The SRPP module receives the entire per-volume feature tensor as input. Slice-wise features and are either spatially flattened or globally pooled, then projected or embedded into vectors of dimension . For two distinct slices , their embeddings (post-Transformer) are concatenated and fed to a lightweight two-layer MLP. The output is a scalar , an estimate of the ground-truth positional offset , constrained to nonzero values across all ordered slice pairs. The prediction function is formally:
A comprehensive correspondence table for module I/O is given below.
| Input | Output | Notes |
|---|---|---|
| , for all slice pairs | ||
| , | , | Embedding via Transformer encoder |
| Scalar prediction | Relative position estimate per pair |
3. Feature Extraction and Embedding Mechanism
To ensure high-capacity relational reasoning, the SRPP extracts per-slice features and processes them with a Transformer encoder. Slice features are mapped to embeddings (either via flattening or a learned linear projection), and the set is input to a Transformer encoder with layers of multi-head self-attention, each followed by a feed-forward MLP and layer normalization. The output collection enables global context modeling, with each embedding containing information about both its own slice and its context within the entire volume. The self-supervised objective drives the embedding space to obey relative ordering constraints, facilitating globally aware volumetric feature representations.
4. Prediction Head, Loss Function, and Training Protocol
The pairwise position predictor is a simple MLP with the following architecture:
- Input: (dimension )
- Hidden layer: , dimension
- Output: , dimension 1
All MLP head weights are initialized independently and trained from scratch. The training loss for SRPP is a mean-squared error over all ordered slice pairs:
The full network’s objective combines segmentation loss (, e.g., Dice loss), boundary detection loss (), and SRPP loss, weighted by hyperparameters and :
with set by grid search (Yang et al., 10 Oct 2025).
5. Role During Training and Influence on Segmentation
SRPP operates exclusively during training. Its auxiliary gradients, backpropagated through the frozen encoder’s features, induce the downstream segmentation branches (Memory Attention, Mask Decoder) to exploit anatomy-aware representations. By doing so, the network reduces slice-to-slice discontinuities (“slice jumps”) and promotes smooth, volumetrically consistent segmentation outputs. At inference, the SRPP branch is disabled—its only function is to have imposed spatial relational structure on features during training.
6. Quantitative Impact and Empirical Evidence
Removal of SRPP from ablated variants leads to uniformly degraded segmentation and boundary localization metrics on standard test datasets. For example, on the Medical Segmentation Decathlon (MSD) Lung dataset, mean Dice drops from 0.7627 (with SRPP) to 0.7535 (without SRPP), and Hausdorff distance at the 95th percentile (HD95) increases from 3.5148 mm to 8.5837 mm. Similar trends are observed on Spleen and Pancreas datasets. The following table summarizes the change in Dice performance:
| Dataset | Full Dice | w/o SRPP Dice | ΔDice |
|---|---|---|---|
| Lung | 0.7627 | 0.7535 | –0.0092 |
| Spleen | 0.9727 | 0.9672 | –0.0055 |
| Pancreas | 0.7039 | 0.6459 | –0.0580 |
Boundary-sensitive scores (HD95, NSD) also deteriorate significantly without SRPP, supporting its necessity for stable, accurate volumetric segmentation (Yang et al., 10 Oct 2025).
7. Significance and Broader Implications
SRPP demonstrates a principled approach to equipping video-centric foundational models (e.g., SAM2) with spatial-relational inductive bias for medical imaging tasks, bridging the gap between temporal and anatomical continuity domains. By explicitly forcing networks to reason about inter-slice ordering and relationships, this method offers a paradigm for adapting a wide range of foundation models to volumetric data in healthcare and beyond. A plausible implication is potential extension to other modalities where bidirectional spatial dependencies are vital, and where label efficiency or self-supervised anatomical priors are advantageous.