Offset-Adjusted Mask2Former
- The paper introduces algebraic offset adjustment strategies within deformable attention, achieving significant segmentation gains (up to +13.6 Dice improvement) on small anatomical structures.
- It integrates a fourth-stage CNN feature map as a coarse spatial prior to guide attention towards compact organs and reduce irrelevant background influence.
- An auxiliary FCN segmentation head with Dice loss is employed to reinforce foreground learning, mitigate background distractions, and accelerate model convergence.
Offset-Adjusted Mask2Former is a transformer-based segmentation framework designed to enhance accuracy for mid-sized and small organ segmentation in medical images. Building upon Mask2Former with deformable attention modules, this approach introduces offset adjustment strategies, leverages the fourth CNN feature map for a coarse organ location prior, and adds a fully convolutional network (FCN) auxiliary head with Dice loss. These architectural innovations specifically address the unreliable sampling patterns and convergence challenges encountered when segmenting small, compact anatomical structures using generic transformer architectures (Zhang et al., 6 Jun 2025).
1. Baseline Framework and Deformable Attention
The foundation of Offset-Adjusted Mask2Former is Mask2Former, which applies multi-scale CNN backbone features to transformer decoding for universal segmentation tasks. Instead of prohibitive dense attention over all pixels, Mask2Former uses the @@@@1@@@@ introduced in Deformable DETR, where each query is parameterized by heads, feature levels, and sampling points per head/level.
The multi-scale deformable attention output for a query at spatial reference is:
Here:
- is the -th feature map from the CNN backbone.
- is the learned offset.
- is the attention weight.
- projects the sampled feature into the decoder embedding space.
This approach reduces self- and cross-attention complexity from to , making the framework tractable for large-scale and 2D/3D hybrid medical image inputs.
2. Offset Adjustment Strategies for Compact Organ Segmentation
Naïve offset sampling in Mask2Former, unconstrained, often results in queries attending to irrelevant background for small organ regions. Offset-Adjusted Mask2Former introduces three per-point algebraic strategies to constrain the learned raw offsets before use:
- Threshold Clipping: For each offset vector ,
with threshold and divisor .
- Softmax Retraction:
so that larger offset magnitudes are down-weighted.
- Scaled Softmax (Best in practice):
where is the softmax weight from above and scale (empirically, yields the best results).
In all cases, the adjusted replaces the default offset in deformable attention. The third strategy ("Sigmoid*2") provides optimal convergence and segmentation quality for compact anatomical targets.
3. Fourth-Stage Feature Map as Coarse Location Prior
Conventional Mask2Former uses only the first three CNN stages (feature levels ) as encoder-memory, discarding the fourth, deepest feature map. Offset-Adjusted Mask2Former incorporates the fourth-stage feature to generate a coarse spatial prior distinguishing organ from background:
- is processed by two convolutional layers (with ReLU activations) to produce .
- is flattened and concatenated with the memory tokens from .
- In the decoder, MSDeformAttn is extended so that, after standard output , a secondary output is computed by attending to using level offsets and weights:
with by default.
This enhances query attention toward likely-organ regions, especially beneficial for compact structures.
4. Auxiliary FCN Head and Dice Loss Integration
To further mitigate background distraction and accelerate training, Offset-Adjusted Mask2Former adds a lightweight FCN segmentation head above :
- Architecture: Two convolutional layers, projecting channels, with bilinear upsampling to match input dimensions.
- Output: Coarse -way segmentation masks (including background).
- Loss: Class-wise Dice loss,
for classes , with .
The final loss sums the standard Mask2Former objective (per-query class/mask losses) and weighted auxiliary Dice-supervised loss:
where .
This auxiliary pathway both constrains the main transformer and directly reinforces learning from likely foreground.
5. Training Protocols, Efficiency, and Implementation Details
Key implementation aspects include:
- Datasets: HaN-Seg (33 CT for validation, 42 CT+MR for testing); SegRap2023 (100 train, 7 val, 10 test CT).
- Preprocessing (“three-channel trick”, SegRap2023): Stack the original slice, a upsample, and a downsample to form input channels.
- Backbone: ResNet-50, extracting stages .
- Batch Size/GPU: 2 per GPU, 8 RTX 4090.
- Optimization: Adam, learning rate , weight decay .
- Schedule: 40k iteration warmup, 120k total iterations.
- Resource Efficiency: Deformable attention yields speed and memory improvement over dense attention ( vs ), enabling affordable 2D–3D hybridization on constrained hardware.
6. Quantitative Performance and Ablation Studies
The following summarizes the evaluation on benchmark datasets:
| Dataset/Metric | Baseline (nnU-Net) | Naïve Mask2Former | Offset-Adjusted Mask2Former |
|---|---|---|---|
| HaN-Seg (35 CT, mDice) | 58.69 | — | 72.26 |
| HaN-Seg (42 CT+MR, mDice) | — | — | 81.60 |
| HaN-Seg (42 CT+MR, mIoU) | — | — | 70.44 |
| SegRap2023 (CT only, mDice) | 84.65 | 84.18 | 87.77 |
- On HaN-Seg, a gain of +13.6 Dice over nnU-Net and +0.35 Dice over prior SOTA (SegReg) was observed.
- On SegRap2023, Offset-Adjusted Mask2Former outperformed previous top results (mean Dice 87.77 vs 84.65 for nnU-Net).
- Ablations revealed the largest improvements arise from the combination of offset adjustment ("Sigmoid*2") and background location head(s). Qualitatively, the largest target-specific gains appeared for Cochlea and Optic Nerve, with consistent improvements for Mandible and Spinal Cord.
7. Full Inference Workflow
Coherent pseudocode captures the inference steps:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
for decoder_layer in decoder_layers: Q = SelfAttentionBlock(Q) # standard transformer self-attn new_Q = [] for q in Q: # for each object query # 1) compute raw offsets and attn weights r = W_off(q) # shape (H,L,K,2) a = Softmax_over_K(W_att(q)) # shape (H,L,K) # 2) apply offset adjustment if strategy == 1: Δ = clip_threshold(r, τ, c) # eqn (strategy 1) elif strategy == 2: w = softmax_over_magnitudes(||r||) # eqn for w_p Δ = w * r # eqn strategy 2 else: # strategy 3 w = softmax_over_magnitudes(||r||) Δ = γ * w * r # eqn strategy 3 # 3) sample features from F1..F3 y_orig = 0 for h in range(H): for l in [1,2,3]: for k in range(K): pt = proj_spatial(q.ref_point) + Δ[h,l,k] feat = bilinear_sample(F_l, pt) y_orig += a[h,l,k] * W_v(feat) # 4) sample coarse features from F4 y_coarse = 0 for h in range(H): for k in range(K): pt = proj_spatial(q.ref_point) + Δ[h,4,k] feat = bilinear_sample(M_coarse, pt) y_coarse += a[h,4,k] * W_v(feat) # 5) fuse and feed to FFN y = y_orig + λ * y_coarse new_q = FFN(y + q) # residual new_Q.append(new_q) Q = new_Q |
8. Context and Implications
Offset-Adjusted Mask2Former demonstrates state-of-the-art performance on two prominent datasets (HaN-Seg and SegRap2023), especially for mid-sized and small structures where standard Transformer-based methods typically underperform. By algebraically adjusting the deformable offsets, integrating deeper semantic features, and guiding convergence with auxiliary Dice losses, it addresses the chief limitations of previous architectures that relied solely on unconstrained attention or local CNN features. Performance gains are concentrated in anatomically challenging regions and compact organs, suggesting that offset-constrained attention effectively integrates fine-scale foreground context while avoiding background confusion (Zhang et al., 6 Jun 2025).
A plausible implication is that algebraic regularization of attention sampling points can serve as a generic principle for transformer-based segmentation models in domains where compact foregrounds predominate and class imbalance is severe.