Dual Self-Distillation for VM-UNet

Updated 3 February 2026

The paper introduces dual self-distillation to align global and local features in VM-UNet, enhancing semantic consistency without increasing inference complexity.
It employs global projection and local progressive MSE-based losses to supervise encoder and decoder representations for improved feature alignment.
Experimental results on ISIC and Synapse benchmarks demonstrate that DSVM-UNet achieves superior segmentation metrics while maintaining computational efficiency.

Dual Self-distillation for VM-UNet (DSVM-UNet) augments the VM-UNet architecture—a Vision Mamba-based, U-shaped encoder–decoder network for medical image segmentation—by integrating a dual self-distillation framework focused on global and local feature alignment. Unlike prior work, which predominantly sought accuracy improvements through increasingly complex architectural modifications, DSVM-UNet aims to improve semantic consistency and feature utilization without increasing inference overhead, leveraging self-distillation losses to enhance network training and generalization (Shao et al., 27 Jan 2026).

1. Architectural Background: VM-UNet and Vision-State-Space Blocks

VM-UNet inherits the canonical U-shaped encoder–decoder architecture with skip connections. The input image $x\in\mathbb{R}^{H\times W\times 3}$ is divided into non-overlapping tokens of size $C$ via a patch-embedding layer. Both encoder and decoder consist of $M$ stages, each outputting feature maps—for the encoder, $f^e_l\in\mathbb{R}^{2^{l-1}C\times \frac{H}{2^{l+1}}\times \frac{W}{2^{l+1}}}$ , and equivalently for the decoder $f^d_l$ .

The core building block is the Vision-State-Space (VSS) block derived from the Vision Mamba family. VSS layers adapt state-space models (SSMs)—such as S4 or Mamba—that propagate an input sequence $x(t)$ using: $h'(t) = A\,h(t) + B\,x(t),\quad y(t) = C\,h(t)$ where $A\in\mathbb{R}^{N\times N}$ , $B\in\mathbb{R}^{N\times1}$ , $C\in\mathbb{R}^{1\times N}$ . Spatial sequences are flattened to enable globally-aware, linear-time context modeling across 2D images.

Final segmentation logits are derived via a $C$ 0 convolution on $C$ 1.

2. Motivation: Rationale for Dual Self-distillation

Conventional strategies for improving VM-UNet performance focus on increasing architectural depth, width, or inter-stage connectivity. Such strategies frequently result in diminishing returns due to misaligned feature semantics, redundancy, and poor cross-scale feature transfer. Self-distillation—wherein deeper (teacher) layers supervise shallower (student) layers—provides an alternative, architecture-agnostic regularization technique.

DSVM-UNet implements two complementary self-distillation objectives:

Global (projection) distillation: Enforces global semantic alignment by projecting all intermediate encoder and decoder features to a common space and aligning them to the deepest decoder feature.
Local (progressive) distillation: Enforces local consistency by aligning representations between neighboring network stages in a stepwise, hierarchical manner.

This approach enables the network to learn representations that are both semantically consistent and detail-sensitive without increasing inference complexity (Shao et al., 27 Jan 2026).

3. Dual Self-distillation Losses and Training Objective

Let $C$ 2 denote the number of encoder and decoder stages, and $C$ 3, $C$ 4 their respective feature maps at level $C$ 5.

3.1 Global (Projection) Feature Alignment

Features at all stages are projected to a uniform spatial size $C$ 6 using a linear resize $C$ 7, then reduced to $C$ 8 channels with a $C$ 9 convolution: $M$ 0 Each projected encoder/decoder feature is supervised using mean squared error (MSE) relative to the deepest decoder feature $M$ 1, yielding the global distillation loss: $M$ 2

3.2 Local (Progressive) Feature Alignment

Adjacent encoder features $M$ 3 are channel-matched via $M$ 4 conv and upsampled to enable MSE alignment: $M$ 5 Similarly for decoder features. The local distillation loss is: $M$ 6

3.3 Segmentation Loss and Training Objective

For binary segmentation, the loss combines Binary Cross-Entropy (BCE) and Dice loss: $M$ 7 with multiclass variants substituting cross-entropy for BCE.

The total loss is: $M$ 8 Default weights: $M$ 9, $f^e_l\in\mathbb{R}^{2^{l-1}C\times \frac{H}{2^{l+1}}\times \frac{W}{2^{l+1}}}$ 0, $f^e_l\in\mathbb{R}^{2^{l-1}C\times \frac{H}{2^{l+1}}\times \frac{W}{2^{l+1}}}$ 1 (Shao et al., 27 Jan 2026).

4. Integration and Computational Characteristics

During training, feature maps from each VSS block are utilized for self-distillation; projection distillation aligns stage-wise features globally, and local distillation operates between adjacent blocks. All distillation heads and losses are training-only constructs—there is no added cost at inference.

A comparative summary:

Model	Params (M)	FLOPs (G)	Inference Overhead
VM-UNet	27.42	4.11	—
VM-UNetV2	22.77	4.40	—
DSVM-UNet	22.63	3.65	None

DSVM-UNet thus maintains or reduces computational complexity relative to prior VM-UNet variants (Shao et al., 27 Jan 2026).

5. Experimental Evaluation

Experiments were conducted across multiple medical imaging benchmarks:

ISIC2017 and ISIC2018 (dermoscopic images): input $f^e_l\in\mathbb{R}^{2^{l-1}C\times \frac{H}{2^{l+1}}\times \frac{W}{2^{l+1}}}$ 2.
Synapse (multi-organ CT): 30 training volumes, $f^e_l\in\mathbb{R}^{2^{l-1}C\times \frac{H}{2^{l+1}}\times \frac{W}{2^{l+1}}}$ 3 slices, eight organ classes.

Metrics included mIoU, DSC, accuracy, specificity, sensitivity, and HD95 (for Synapse). Key segmentation results:

Dataset	mIoU / DSC / Spec / Sens	DSVM-UNet	VM-UNetV2
ISIC2017	mIoU	82.57%	82.34%
	DSC	90.62%	90.31%
	Specificity	98.34%	97.67%
	Sensitivity	92.08%	91.89%
ISIC2018	mIoU	81.51%	81.37%
	DSC	90.45%	89.73%
	Accuracy	95.43%	95.06%
Synapse	DSC	81.68%	81.21%
	HD95 (px)	19.32	20.64

Ablation experiments indicated performance gains for each distillation type and the greatest improvement when both are combined.

6. Implementation Protocols

Training employed AdamW ( $f^e_l\in\mathbb{R}^{2^{l-1}C\times \frac{H}{2^{l+1}}\times \frac{W}{2^{l+1}}}$ 4, cosine annealing to $f^e_l\in\mathbb{R}^{2^{l-1}C\times \frac{H}{2^{l+1}}\times \frac{W}{2^{l+1}}}$ 5 across 300 epochs), batch size 32, PyTorch/NVIDIA RTX A40, input size $f^e_l\in\mathbb{R}^{2^{l-1}C\times \frac{H}{2^{l+1}}\times \frac{W}{2^{l+1}}}$ 6, and ImageNet-1k pre-trained VMamba-S initializations. All loss terms and distillation heads are removed after training—final deployment uses only the segmentation head.

7. Limitations and Prospects

DSVM-UNet’s dual self-distillation intensifies intra-network semantic and local alignment without requiring more complex architectures. However, current distillation utilizes only MSE on raw feature maps; potential future directions include:

Incorporating perceptual or attention-based distillation.
Cross-model or multi-task distillation regimes.
Soft-label (temperature-scaled) distillation at the logit level.
Extension to volumetric (3D) architectures and multi-modal input.
Adaptive per-layer or per-task distillation weighting (Shao et al., 27 Jan 2026).

Dual Self-distillation for VM-UNet underscores the potential of leveraging feature-level regularization strategies to achieve segmentation improvements in medical imaging, maintaining computational efficiency and architectural simplicity. The method aligns conceptually with other dual self-distillation strategies in U-shaped networks, such as the volumetric approach in Banerjee et al. (Banerjee et al., 2023), though DSVM-UNet uniquely targets linear-time, Vision Mamba-based 2D UNet architectures.

Markdown Report Issue Upgrade to Chat

References (2)

DSVM-UNet : Enhancing VM-UNet with Dual Self-distillation for Medical Image Segmentation (2026)

Volumetric medical image segmentation through dual self-distillation in U-shaped networks (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dual Self-distillation for VM-UNet (DSVM-UNet).

Dual Self-Distillation for VM-UNet

1. Architectural Background: VM-UNet and Vision-State-Space Blocks

2. Motivation: Rationale for Dual Self-distillation

3. Dual Self-distillation Losses and Training Objective

3.1 Global (Projection) Feature Alignment

3.2 Local (Progressive) Feature Alignment

3.3 Segmentation Loss and Training Objective

4. Integration and Computational Characteristics

5. Experimental Evaluation

6. Implementation Protocols

7. Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Dual Self-Distillation for VM-UNet

1. Architectural Background: VM-UNet and Vision-State-Space Blocks

2. Motivation: Rationale for Dual Self-distillation

3. Dual Self-distillation Losses and Training Objective

3.1 Global (Projection) Feature Alignment

3.2 Local (Progressive) Feature Alignment

3.3 Segmentation Loss and Training Objective

4. Integration and Computational Characteristics

5. Experimental Evaluation

6. Implementation Protocols

7. Limitations and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research