Papers
Topics
Authors
Recent
Search
2000 character limit reached

SRPP in 3D Med Image Segmentation

Updated 25 March 2026
  • The paper introduces SRPP as a self-supervised auxiliary module that predicts relative slice positions, enhancing inter-slice anatomical consistency.
  • SRPP integrates a Transformer encoder and a lightweight MLP to model inter-slice dependencies, significantly improving segmentation overlap and boundary delineation.
  • Quantitative results demonstrate that SRPP boosts Dice performance and reduces boundary errors, proving its effectiveness as a training regularizer in 3D medical segmentation.

Slice Relative Position Prediction (SRPP) is a self-supervised auxiliary module introduced in the SAM2-3dMed framework to impose explicit bidirectional anatomical awareness on feature representations for 3D medical image segmentation. Designed to address the limitations of adapting video-centric architectures such as SAM2—originally reliant on temporal continuity—to the spatially contiguous context of volumetric medical data, SRPP guides the encoder network to model inter-slice dependencies by predicting the relative positions between arbitrary slice pairs in a reinforced embedding space. The SRPP module operates only during training, serving as a strict regularizer to encourage anatomically consistent feature hierarchies, resulting in improved segmentation overlap, boundary delineation, and inter-slice mask coherence (Yang et al., 10 Oct 2025).

1. Architectural Placement and Integration

SRPP is positioned in parallel with the segmentation and boundary detection branches within the SAM2-3dMed framework. The frozen SAM2 Image Encoder produces volumetric features ZRC×D×H×WZ \in \mathbb{R}^{C \times D \times H' \times W'}, where CC is channel count, DD the number of slices, and H,WH', W' the spatial dimensions. Each slice’s feature map, ZiRC×H×WZ_i \in \mathbb{R}^{C \times H' \times W'}, is processed by the SRPP module. Feature vectors corresponding to individual slices are aggregated and passed through a dedicated Transformer encoder, yielding contextually enriched slice embeddings. These embeddings are the basis for pairwise relative position prediction across all slice combinations, with subsequent gradients propagated into the frozen backbone to enforce feature discriminability aligned with anatomical continuity.

2. Input/Output Specification and Data Flow

The SRPP module receives the entire per-volume feature tensor ZZ as input. Slice-wise features ZiZ_i and ZjZ_j are either spatially flattened or globally pooled, then projected or embedded into vectors of dimension dmodeld_{model}. For two distinct slices iji \neq j, their embeddings Ei,EjRdmodelE_i, E_j \in \mathbb{R}^{d_{model}} (post-Transformer) are concatenated and fed to a lightweight two-layer MLP. The output is a scalar (Ppos)i,j(P_{pos})_{i,j}, an estimate of the ground-truth positional offset (GTpos)i,j=ji(GT_{pos})_{i,j} = j - i, constrained to nonzero values across all ordered slice pairs. The prediction function is formally:

(Ppos)i,j=MLP([Ei;Ej])(P_{pos})_{i,j} = MLP([E_i; E_j])

A comprehensive correspondence table for module I/O is given below.

Input Output Notes
ZZ (Ppos)i,j(P_{pos})_{i,j} iji \neq j, for all slice pairs
ZiZ_i, ZjZ_j EiE_i, EjE_j Embedding via Transformer encoder
[Ei;Ej][E_i;E_j] Scalar prediction Relative position estimate per pair

3. Feature Extraction and Embedding Mechanism

To ensure high-capacity relational reasoning, the SRPP extracts per-slice features and processes them with a Transformer encoder. Slice features ZiZ_i are mapped to embeddings (either via flattening or a learned linear projection), and the set {Z1,...,ZD}\{Z_1,...,Z_D\} is input to a Transformer encoder with NN layers of multi-head self-attention, each followed by a feed-forward MLP and layer normalization. The output collection ERD×dmodelE \in \mathbb{R}^{D \times d_{model}} enables global context modeling, with each embedding EiE_i containing information about both its own slice and its context within the entire volume. The self-supervised objective drives the embedding space to obey relative ordering constraints, facilitating globally aware volumetric feature representations.

4. Prediction Head, Loss Function, and Training Protocol

The pairwise position predictor is a simple MLP with the following architecture:

  • Input: [Ei;Ej][E_i; E_j] (dimension 2dmodel2d_{model})
  • Hidden layer: hidden=ReLU(W1[Ei;Ej]+b1)hidden = ReLU(W_1 [E_i;E_j] + b_1), dimension dhd_h
  • Output: Ppos(i,j)=W2hidden+b2P_{pos}(i,j) = W_2 hidden + b_2, dimension 1

All MLP head weights are initialized independently and trained from scratch. The training loss for SRPP is a mean-squared error over all ordered slice pairs:

Lsrpp=1D(D1)i=1Dj=1,jiD(Ppos(i,j)(ji))2\mathcal{L}_{srpp} = \frac{1}{D(D-1)} \sum_{i=1}^{D} \sum_{j=1, j \neq i}^{D} \left(P_{pos}(i,j) - (j-i)\right)^2

The full network’s objective combines segmentation loss (Lseg\mathcal{L}_{seg}, e.g., Dice loss), boundary detection loss (Lbd\mathcal{L}_{bd}), and SRPP loss, weighted by hyperparameters λ1\lambda_1 and λ2\lambda_2:

Ltotal=Lseg+λ1Lsrpp+λ2Lbd\mathcal{L}_{total} = \mathcal{L}_{seg} + \lambda_1 \mathcal{L}_{srpp} + \lambda_2 \mathcal{L}_{bd}

with λ1=0.01,λ2=0.1\lambda_1 = 0.01, \lambda_2 = 0.1 set by grid search (Yang et al., 10 Oct 2025).

5. Role During Training and Influence on Segmentation

SRPP operates exclusively during training. Its auxiliary gradients, backpropagated through the frozen encoder’s features, induce the downstream segmentation branches (Memory Attention, Mask Decoder) to exploit anatomy-aware representations. By doing so, the network reduces slice-to-slice discontinuities (“slice jumps”) and promotes smooth, volumetrically consistent segmentation outputs. At inference, the SRPP branch is disabled—its only function is to have imposed spatial relational structure on features during training.

6. Quantitative Impact and Empirical Evidence

Removal of SRPP from ablated variants leads to uniformly degraded segmentation and boundary localization metrics on standard test datasets. For example, on the Medical Segmentation Decathlon (MSD) Lung dataset, mean Dice drops from 0.7627 (with SRPP) to 0.7535 (without SRPP), and Hausdorff distance at the 95th percentile (HD95) increases from 3.5148 mm to 8.5837 mm. Similar trends are observed on Spleen and Pancreas datasets. The following table summarizes the change in Dice performance:

Dataset Full Dice w/o SRPP Dice ΔDice
Lung 0.7627 0.7535 –0.0092
Spleen 0.9727 0.9672 –0.0055
Pancreas 0.7039 0.6459 –0.0580

Boundary-sensitive scores (HD95, NSD) also deteriorate significantly without SRPP, supporting its necessity for stable, accurate volumetric segmentation (Yang et al., 10 Oct 2025).

7. Significance and Broader Implications

SRPP demonstrates a principled approach to equipping video-centric foundational models (e.g., SAM2) with spatial-relational inductive bias for medical imaging tasks, bridging the gap between temporal and anatomical continuity domains. By explicitly forcing networks to reason about inter-slice ordering and relationships, this method offers a paradigm for adapting a wide range of foundation models to volumetric data in healthcare and beyond. A plausible implication is potential extension to other modalities where bidirectional spatial dependencies are vital, and where label efficiency or self-supervised anatomical priors are advantageous.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Slice Relative Position Prediction (SRPP).