Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dual Self-Distillation for VM-UNet

Updated 3 February 2026
  • The paper introduces dual self-distillation to align global and local features in VM-UNet, enhancing semantic consistency without increasing inference complexity.
  • It employs global projection and local progressive MSE-based losses to supervise encoder and decoder representations for improved feature alignment.
  • Experimental results on ISIC and Synapse benchmarks demonstrate that DSVM-UNet achieves superior segmentation metrics while maintaining computational efficiency.

Dual Self-distillation for VM-UNet (DSVM-UNet) augments the VM-UNet architecture—a Vision Mamba-based, U-shaped encoder–decoder network for medical image segmentation—by integrating a dual self-distillation framework focused on global and local feature alignment. Unlike prior work, which predominantly sought accuracy improvements through increasingly complex architectural modifications, DSVM-UNet aims to improve semantic consistency and feature utilization without increasing inference overhead, leveraging self-distillation losses to enhance network training and generalization (Shao et al., 27 Jan 2026).

1. Architectural Background: VM-UNet and Vision-State-Space Blocks

VM-UNet inherits the canonical U-shaped encoder–decoder architecture with skip connections. The input image x∈RH×W×3x\in\mathbb{R}^{H\times W\times 3} is divided into non-overlapping tokens of size CC via a patch-embedding layer. Both encoder and decoder consist of MM stages, each outputting feature maps—for the encoder, fle∈R2l−1C×H2l+1×W2l+1f^e_l\in\mathbb{R}^{2^{l-1}C\times \frac{H}{2^{l+1}}\times \frac{W}{2^{l+1}}}, and equivalently for the decoder fldf^d_l.

The core building block is the Vision-State-Space (VSS) block derived from the Vision Mamba family. VSS layers adapt state-space models (SSMs)—such as S4 or Mamba—that propagate an input sequence x(t)x(t) using: h′(t)=A h(t)+B x(t),y(t)=C h(t)h'(t) = A\,h(t) + B\,x(t),\quad y(t) = C\,h(t) where A∈RN×NA\in\mathbb{R}^{N\times N}, B∈RN×1B\in\mathbb{R}^{N\times1}, C∈R1×NC\in\mathbb{R}^{1\times N}. Spatial sequences are flattened to enable globally-aware, linear-time context modeling across 2D images.

Final segmentation logits are derived via a 1×11\times1 convolution on f1df^d_1.

2. Motivation: Rationale for Dual Self-distillation

Conventional strategies for improving VM-UNet performance focus on increasing architectural depth, width, or inter-stage connectivity. Such strategies frequently result in diminishing returns due to misaligned feature semantics, redundancy, and poor cross-scale feature transfer. Self-distillation—wherein deeper (teacher) layers supervise shallower (student) layers—provides an alternative, architecture-agnostic regularization technique.

DSVM-UNet implements two complementary self-distillation objectives:

  • Global (projection) distillation: Enforces global semantic alignment by projecting all intermediate encoder and decoder features to a common space and aligning them to the deepest decoder feature.
  • Local (progressive) distillation: Enforces local consistency by aligning representations between neighboring network stages in a stepwise, hierarchical manner.

This approach enables the network to learn representations that are both semantically consistent and detail-sensitive without increasing inference complexity (Shao et al., 27 Jan 2026).

3. Dual Self-distillation Losses and Training Objective

Let MM denote the number of encoder and decoder stages, and flef^e_l, fldf^d_l their respective feature maps at level ll.

3.1 Global (Projection) Feature Alignment

Features at all stages are projected to a uniform spatial size (H4×W4)(\frac{H}{4}\times\frac{W}{4}) using a linear resize Lin(⋅)\mathrm{Lin}(\cdot), then reduced to CC channels with a 1×11\times1 convolution: f^l=Conv1D(Lin(fl)),f^l∈RC×H4×W4\hat f_l = \mathrm{Conv1D}\bigl(\mathrm{Lin}(f_l)\bigr),\quad \hat f_l\in\mathbb{R}^{C\times\frac H4\times\frac W4} Each projected encoder/decoder feature is supervised using mean squared error (MSE) relative to the deepest decoder feature f1df^d_1, yielding the global distillation loss: Lglobal=∑l=1MMSE(f^le,f1d)+∑l=1M−1MSE(f^ld,f1d)\mathcal L_{\mathrm{global}} = \sum_{l=1}^M \mathrm{MSE}(\hat f^e_l,f^d_1) + \sum_{l=1}^{M-1} \mathrm{MSE}(\hat f^d_l,f^d_1)

3.2 Local (Progressive) Feature Alignment

Adjacent encoder features (fl−1e,fle)(f^e_{l-1},f^e_l) are channel-matched via 1×11\times1 conv and upsampled to enable MSE alignment: f~l−1e=Upsample(Conv2D(fle)),f~l−1e∈RC×H2l×W2l\tilde f^e_{l-1} = \mathrm{Upsample}(\mathrm{Conv2D}(f^e_l)),\quad \tilde f^e_{l-1}\in\mathbb{R}^{C\times\frac H{2^l}\times\frac W{2^l}} Similarly for decoder features. The local distillation loss is: Llocal=∑l=2M[MSE(f~l−1e,fl−1e)+MSE(f~ld,fld)]\mathcal L_{\mathrm{local}} = \sum_{l=2}^M [\mathrm{MSE}(\tilde f^e_{l-1},f^e_{l-1}) + \mathrm{MSE}(\tilde f^d_l,f^d_l)]

3.3 Segmentation Loss and Training Objective

For binary segmentation, the loss combines Binary Cross-Entropy (BCE) and Dice loss: Lseg=λ1 LBCE+λ2 LDice\mathcal L_{\mathrm{seg}} = \lambda_1\,\mathcal L_{\mathrm{BCE}} + \lambda_2\,\mathcal L_{\mathrm{Dice}} with multiclass variants substituting cross-entropy for BCE.

The total loss is: Ltotal=Lseg+λg Lglobal+λl Llocal\mathcal L_{\mathrm{total}} = \mathcal L_{\mathrm{seg}} + \lambda_g\,\mathcal L_{\mathrm{global}} + \lambda_l\,\mathcal L_{\mathrm{local}} Default weights: λg=1\lambda_g=1, λl=0.5\lambda_l=0.5, λ1=λ2=1\lambda_1=\lambda_2=1 (Shao et al., 27 Jan 2026).

4. Integration and Computational Characteristics

During training, feature maps from each VSS block are utilized for self-distillation; projection distillation aligns stage-wise features globally, and local distillation operates between adjacent blocks. All distillation heads and losses are training-only constructs—there is no added cost at inference.

A comparative summary:

Model Params (M) FLOPs (G) Inference Overhead
VM-UNet 27.42 4.11 —
VM-UNetV2 22.77 4.40 —
DSVM-UNet 22.63 3.65 None

DSVM-UNet thus maintains or reduces computational complexity relative to prior VM-UNet variants (Shao et al., 27 Jan 2026).

5. Experimental Evaluation

Experiments were conducted across multiple medical imaging benchmarks:

  • ISIC2017 and ISIC2018 (dermoscopic images): input 256×256256\times256.
  • Synapse (multi-organ CT): 30 training volumes, 256×256256\times256 slices, eight organ classes.

Metrics included mIoU, DSC, accuracy, specificity, sensitivity, and HD95 (for Synapse). Key segmentation results:

Dataset mIoU / DSC / Spec / Sens DSVM-UNet VM-UNetV2
ISIC2017 mIoU 82.57% 82.34%
DSC 90.62% 90.31%
Specificity 98.34% 97.67%
Sensitivity 92.08% 91.89%
ISIC2018 mIoU 81.51% 81.37%
DSC 90.45% 89.73%
Accuracy 95.43% 95.06%
Synapse DSC 81.68% 81.21%
HD95 (px) 19.32 20.64

Ablation experiments indicated performance gains for each distillation type and the greatest improvement when both are combined.

6. Implementation Protocols

Training employed AdamW (lr=1e-3\text{lr}=1\text{e-3}, cosine annealing to 1e-51\text{e-5} across 300 epochs), batch size 32, PyTorch/NVIDIA RTX A40, input size 256×256256\times256, and ImageNet-1k pre-trained VMamba-S initializations. All loss terms and distillation heads are removed after training—final deployment uses only the segmentation head.

7. Limitations and Prospects

DSVM-UNet’s dual self-distillation intensifies intra-network semantic and local alignment without requiring more complex architectures. However, current distillation utilizes only MSE on raw feature maps; potential future directions include:

  • Incorporating perceptual or attention-based distillation.
  • Cross-model or multi-task distillation regimes.
  • Soft-label (temperature-scaled) distillation at the logit level.
  • Extension to volumetric (3D) architectures and multi-modal input.
  • Adaptive per-layer or per-task distillation weighting (Shao et al., 27 Jan 2026).

Dual Self-distillation for VM-UNet underscores the potential of leveraging feature-level regularization strategies to achieve segmentation improvements in medical imaging, maintaining computational efficiency and architectural simplicity. The method aligns conceptually with other dual self-distillation strategies in U-shaped networks, such as the volumetric approach in Banerjee et al. (Banerjee et al., 2023), though DSVM-UNet uniquely targets linear-time, Vision Mamba-based 2D UNet architectures.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual Self-distillation for VM-UNet (DSVM-UNet).