Papers
Topics
Authors
Recent
2000 character limit reached

Offset-Adjusted Mask2Former

Updated 29 December 2025
  • The paper introduces algebraic offset adjustment strategies within deformable attention, achieving significant segmentation gains (up to +13.6 Dice improvement) on small anatomical structures.
  • It integrates a fourth-stage CNN feature map as a coarse spatial prior to guide attention towards compact organs and reduce irrelevant background influence.
  • An auxiliary FCN segmentation head with Dice loss is employed to reinforce foreground learning, mitigate background distractions, and accelerate model convergence.

Offset-Adjusted Mask2Former is a transformer-based segmentation framework designed to enhance accuracy for mid-sized and small organ segmentation in medical images. Building upon Mask2Former with deformable attention modules, this approach introduces offset adjustment strategies, leverages the fourth CNN feature map for a coarse organ location prior, and adds a fully convolutional network (FCN) auxiliary head with Dice loss. These architectural innovations specifically address the unreliable sampling patterns and convergence challenges encountered when segmenting small, compact anatomical structures using generic transformer architectures (Zhang et al., 6 Jun 2025).

1. Baseline Framework and Deformable Attention

The foundation of Offset-Adjusted Mask2Former is Mask2Former, which applies multi-scale CNN backbone features to transformer decoding for universal segmentation tasks. Instead of prohibitive dense attention over all H×WH \times W pixels, Mask2Former uses the @@@@1@@@@ introduced in Deformable DETR, where each query qRCq \in \mathbb{R}^C is parameterized by HH heads, LL feature levels, and KK sampling points per head/level.

The multi-scale deformable attention output for a query qq at spatial reference pqp_q is:

yq=MSDeformAttn(q,{Fl})=h=1Hl=1Lk=1Kαh,l,k(q)WvFl(pq+Δph,l,k(q))y_q = \mathrm{MSDeformAttn}(q, \{F_l\}) = \sum_{h=1}^{H} \sum_{l=1}^{L} \sum_{k=1}^{K} \alpha_{h,l,k}(q) \, W_v F_l(p_q + \Delta p_{h,l,k}(q))

Here:

  • FlRC×Hl×WlF_l \in \mathbb{R}^{C \times H_l \times W_l} is the ll-th feature map from the CNN backbone.
  • Δph,l,k(q)=Woff(h,l,k)qR2\Delta p_{h,l,k}(q) = W_\mathrm{off}^{(h,l,k)} q \in \mathbb{R}^2 is the learned offset.
  • αh,l,k(q)=Softmaxk(Watt(h,l)q)\alpha_{h,l,k}(q) = \mathrm{Softmax}_k(W_\mathrm{att}^{(h,l)} q) is the attention weight.
  • WvW_v projects the sampled feature into the decoder embedding space.

This approach reduces self- and cross-attention complexity from O((HW)2)O((HW)^2) to O(HLKNq)O(HLK N_q), making the framework tractable for large-scale and 2D/3D hybrid medical image inputs.

2. Offset Adjustment Strategies for Compact Organ Segmentation

Naïve offset sampling in Mask2Former, unconstrained, often results in queries attending to irrelevant background for small organ regions. Offset-Adjusted Mask2Former introduces three per-point algebraic strategies to constrain the learned raw offsets rh,l,k=Woff(h,l,k)qr_{h,l,k} = W_\mathrm{off}^{(h,l,k)} q before use:

  1. Threshold Clipping: For each offset vector rpr_p,

Δp(1)=rpmin(1,τrp)\Delta_p^{(1)} = r_p \cdot \min\left(1, \frac{\tau}{\|r_p\|}\right)

with threshold τ\tau and divisor c>1c>1.

  1. Softmax Retraction:

wp=exp(rp)j=1Pexp(rj),Δp(2)=wprpw_p = \frac{\exp(\|r_p\|)}{\sum_{j=1}^P \exp(\|r_j\|)}, \qquad \Delta_p^{(2)} = w_p r_p

so that larger offset magnitudes are down-weighted.

  1. Scaled Softmax (Best in practice):

Δp(3)=γwprp\Delta_p^{(3)} = \gamma w_p r_p

where wpw_p is the softmax weight from above and scale γ>1\gamma>1 (empirically, γ=2\gamma=2 yields the best results).

In all cases, the adjusted Δph,l,k\Delta p_{h,l,k} replaces the default offset in deformable attention. The third strategy ("Sigmoid*2") provides optimal convergence and segmentation quality for compact anatomical targets.

3. Fourth-Stage Feature Map as Coarse Location Prior

Conventional Mask2Former uses only the first three CNN stages (feature levels l=1,2,3l=1,2,3) as encoder-memory, discarding the fourth, deepest feature map. Offset-Adjusted Mask2Former incorporates the fourth-stage feature F4F_4 to generate a coarse spatial prior distinguishing organ from background:

  • F4F_4 is processed by two 3×33 \times 3 convolutional layers (with ReLU activations) to produce McoarseRCdecoder×H4×W4M_\mathrm{coarse} \in \mathbb{R}^{C_\mathrm{decoder} \times H_4 \times W_4}.
  • McoarseM_\mathrm{coarse} is flattened and concatenated with the memory tokens from F1,F2,F3F_1, F_2, F_3.
  • In the decoder, MSDeformAttn is extended so that, after standard output yqorigy_q^\mathrm{orig}, a secondary output yqcoarsey_q^\mathrm{coarse} is computed by attending to McoarseM_\mathrm{coarse} using level l=4l=4 offsets and weights:

yq=yqorig+λyqcoarsey_q = y_q^\mathrm{orig} + \lambda y_q^\mathrm{coarse}

with λ=1.0\lambda=1.0 by default.

This enhances query attention toward likely-organ regions, especially beneficial for compact structures.

4. Auxiliary FCN Head and Dice Loss Integration

To further mitigate background distraction and accelerate training, Offset-Adjusted Mask2Former adds a lightweight FCN segmentation head above F4F_4:

  • Architecture: Two 3×33 \times 3 convolutional layers, projecting C4128NclassesC_4 \rightarrow 128 \rightarrow N_\textrm{classes} channels, with bilinear upsampling to match input dimensions.
  • Output: Coarse (Nclasses+1)(N_\textrm{classes}+1)-way segmentation masks (including background).
  • Loss: Class-wise Dice loss,

LDice=12ipi,cgi,c+εipi,c+igi,c+ε\mathcal{L}_\mathrm{Dice} = 1 - \frac{2 \sum_i p_{i,c} g_{i,c} + \varepsilon}{\sum_i p_{i,c} + \sum_i g_{i,c} + \varepsilon}

for classes c=0Nc=0 \ldots N, with ε=105\varepsilon=10^{-5}.

The final loss sums the standard Mask2Former objective (per-query class/mask losses) and weighted auxiliary Dice-supervised loss:

L=LMask2F+αLDiceaux\mathcal{L} = \mathcal{L}_\mathrm{Mask2F} + \alpha \mathcal{L}_\mathrm{Dice-aux}

where α=0.5\alpha=0.5.

This auxiliary pathway both constrains the main transformer and directly reinforces learning from likely foreground.

5. Training Protocols, Efficiency, and Implementation Details

Key implementation aspects include:

  • Datasets: HaN-Seg (33 CT for validation, 42 CT+MR for testing); SegRap2023 (100 train, 7 val, 10 test CT).
  • Preprocessing (“three-channel trick”, SegRap2023): Stack the original slice, a 2×2\times upsample, and a 0.5×0.5\times downsample to form input channels.
  • Backbone: ResNet-50, extracting stages l=14l=1 \ldots 4.
  • Batch Size/GPU: 2 per GPU, 8 ×\times RTX 4090.
  • Optimization: Adam, learning rate 10410^{-4}, weight decay 10410^{-4}.
  • Schedule: 40k iteration warmup, 120k total iterations.
  • Resource Efficiency: Deformable attention yields 5-10×5{\text -}10\times speed and memory improvement over dense attention (O(HLKNq)O(HLK N_q) vs O((HW)2)O((HW)^2)), enabling affordable 2D–3D hybridization on constrained hardware.

6. Quantitative Performance and Ablation Studies

The following summarizes the evaluation on benchmark datasets:

Dataset/Metric Baseline (nnU-Net) Naïve Mask2Former Offset-Adjusted Mask2Former
HaN-Seg (35 CT, mDice) 58.69 72.26
HaN-Seg (42 CT+MR, mDice) 81.60
HaN-Seg (42 CT+MR, mIoU) 70.44
SegRap2023 (CT only, mDice) 84.65 84.18 87.77
  • On HaN-Seg, a gain of +13.6 Dice over nnU-Net and +0.35 Dice over prior SOTA (SegReg) was observed.
  • On SegRap2023, Offset-Adjusted Mask2Former outperformed previous top results (mean Dice 87.77 vs 84.65 for nnU-Net).
  • Ablations revealed the largest improvements arise from the combination of offset adjustment ("Sigmoid*2") and background location head(s). Qualitatively, the largest target-specific gains appeared for Cochlea and Optic Nerve, with consistent improvements for Mandible and Spinal Cord.

7. Full Inference Workflow

Coherent pseudocode captures the inference steps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
for decoder_layer in decoder_layers:
  Q = SelfAttentionBlock(Q)                   # standard transformer self-attn
  new_Q = []
  for q in Q:                                # for each object query
    # 1) compute raw offsets and attn weights
    r   = W_off(q)                          # shape (H,L,K,2)
    a   = Softmax_over_K(W_att(q))          # shape (H,L,K)
    # 2) apply offset adjustment
    if strategy == 1:
      Δ = clip_threshold(r, τ, c)            # eqn (strategy 1)
    elif strategy == 2:
      w = softmax_over_magnitudes(||r||)     # eqn for w_p
      Δ = w * r                              # eqn strategy 2
    else: # strategy 3
      w = softmax_over_magnitudes(||r||)
      Δ = γ * w * r                          # eqn strategy 3

    # 3) sample features from F1..F3
    y_orig = 0
    for h in range(H):
      for l in [1,2,3]:
        for k in range(K):
          pt = proj_spatial(q.ref_point) + Δ[h,l,k]
          feat = bilinear_sample(F_l, pt)
          y_orig += a[h,l,k] * W_v(feat)

    # 4) sample coarse features from F4
    y_coarse = 0
    for h in range(H):
      for k in range(K):
        pt = proj_spatial(q.ref_point) + Δ[h,4,k]
        feat = bilinear_sample(M_coarse, pt)
        y_coarse += a[h,4,k] * W_v(feat)

    # 5) fuse and feed to FFN
    y = y_orig + λ * y_coarse
    new_q = FFN(y + q)    # residual
    new_Q.append(new_q)
  Q = new_Q
Key subroutines: bilinear sampling from feature maps and spatial projection of query reference points.

8. Context and Implications

Offset-Adjusted Mask2Former demonstrates state-of-the-art performance on two prominent datasets (HaN-Seg and SegRap2023), especially for mid-sized and small structures where standard Transformer-based methods typically underperform. By algebraically adjusting the deformable offsets, integrating deeper semantic features, and guiding convergence with auxiliary Dice losses, it addresses the chief limitations of previous architectures that relied solely on unconstrained attention or local CNN features. Performance gains are concentrated in anatomically challenging regions and compact organs, suggesting that offset-constrained attention effectively integrates fine-scale foreground context while avoiding background confusion (Zhang et al., 6 Jun 2025).

A plausible implication is that algebraic regularization of attention sampling points can serve as a generic principle for transformer-based segmentation models in domains where compact foregrounds predominate and class imbalance is severe.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Offset-Adjusted Mask2Former.