Dual Self-Distillation for VM-UNet
- The paper introduces dual self-distillation to align global and local features in VM-UNet, enhancing semantic consistency without increasing inference complexity.
- It employs global projection and local progressive MSE-based losses to supervise encoder and decoder representations for improved feature alignment.
- Experimental results on ISIC and Synapse benchmarks demonstrate that DSVM-UNet achieves superior segmentation metrics while maintaining computational efficiency.
Dual Self-distillation for VM-UNet (DSVM-UNet) augments the VM-UNet architecture—a Vision Mamba-based, U-shaped encoder–decoder network for medical image segmentation—by integrating a dual self-distillation framework focused on global and local feature alignment. Unlike prior work, which predominantly sought accuracy improvements through increasingly complex architectural modifications, DSVM-UNet aims to improve semantic consistency and feature utilization without increasing inference overhead, leveraging self-distillation losses to enhance network training and generalization (Shao et al., 27 Jan 2026).
1. Architectural Background: VM-UNet and Vision-State-Space Blocks
VM-UNet inherits the canonical U-shaped encoder–decoder architecture with skip connections. The input image is divided into non-overlapping tokens of size via a patch-embedding layer. Both encoder and decoder consist of stages, each outputting feature maps—for the encoder, , and equivalently for the decoder .
The core building block is the Vision-State-Space (VSS) block derived from the Vision Mamba family. VSS layers adapt state-space models (SSMs)—such as S4 or Mamba—that propagate an input sequence using: where , , . Spatial sequences are flattened to enable globally-aware, linear-time context modeling across 2D images.
Final segmentation logits are derived via a convolution on .
2. Motivation: Rationale for Dual Self-distillation
Conventional strategies for improving VM-UNet performance focus on increasing architectural depth, width, or inter-stage connectivity. Such strategies frequently result in diminishing returns due to misaligned feature semantics, redundancy, and poor cross-scale feature transfer. Self-distillation—wherein deeper (teacher) layers supervise shallower (student) layers—provides an alternative, architecture-agnostic regularization technique.
DSVM-UNet implements two complementary self-distillation objectives:
- Global (projection) distillation: Enforces global semantic alignment by projecting all intermediate encoder and decoder features to a common space and aligning them to the deepest decoder feature.
- Local (progressive) distillation: Enforces local consistency by aligning representations between neighboring network stages in a stepwise, hierarchical manner.
This approach enables the network to learn representations that are both semantically consistent and detail-sensitive without increasing inference complexity (Shao et al., 27 Jan 2026).
3. Dual Self-distillation Losses and Training Objective
Let denote the number of encoder and decoder stages, and , their respective feature maps at level .
3.1 Global (Projection) Feature Alignment
Features at all stages are projected to a uniform spatial size using a linear resize , then reduced to channels with a convolution: Each projected encoder/decoder feature is supervised using mean squared error (MSE) relative to the deepest decoder feature , yielding the global distillation loss:
3.2 Local (Progressive) Feature Alignment
Adjacent encoder features are channel-matched via conv and upsampled to enable MSE alignment: Similarly for decoder features. The local distillation loss is:
3.3 Segmentation Loss and Training Objective
For binary segmentation, the loss combines Binary Cross-Entropy (BCE) and Dice loss: with multiclass variants substituting cross-entropy for BCE.
The total loss is: Default weights: , , (Shao et al., 27 Jan 2026).
4. Integration and Computational Characteristics
During training, feature maps from each VSS block are utilized for self-distillation; projection distillation aligns stage-wise features globally, and local distillation operates between adjacent blocks. All distillation heads and losses are training-only constructs—there is no added cost at inference.
A comparative summary:
| Model | Params (M) | FLOPs (G) | Inference Overhead |
|---|---|---|---|
| VM-UNet | 27.42 | 4.11 | — |
| VM-UNetV2 | 22.77 | 4.40 | — |
| DSVM-UNet | 22.63 | 3.65 | None |
DSVM-UNet thus maintains or reduces computational complexity relative to prior VM-UNet variants (Shao et al., 27 Jan 2026).
5. Experimental Evaluation
Experiments were conducted across multiple medical imaging benchmarks:
- ISIC2017 and ISIC2018 (dermoscopic images): input .
- Synapse (multi-organ CT): 30 training volumes, slices, eight organ classes.
Metrics included mIoU, DSC, accuracy, specificity, sensitivity, and HD95 (for Synapse). Key segmentation results:
| Dataset | mIoU / DSC / Spec / Sens | DSVM-UNet | VM-UNetV2 |
|---|---|---|---|
| ISIC2017 | mIoU | 82.57% | 82.34% |
| DSC | 90.62% | 90.31% | |
| Specificity | 98.34% | 97.67% | |
| Sensitivity | 92.08% | 91.89% | |
| ISIC2018 | mIoU | 81.51% | 81.37% |
| DSC | 90.45% | 89.73% | |
| Accuracy | 95.43% | 95.06% | |
| Synapse | DSC | 81.68% | 81.21% |
| HD95 (px) | 19.32 | 20.64 |
Ablation experiments indicated performance gains for each distillation type and the greatest improvement when both are combined.
6. Implementation Protocols
Training employed AdamW (, cosine annealing to across 300 epochs), batch size 32, PyTorch/NVIDIA RTX A40, input size , and ImageNet-1k pre-trained VMamba-S initializations. All loss terms and distillation heads are removed after training—final deployment uses only the segmentation head.
7. Limitations and Prospects
DSVM-UNet’s dual self-distillation intensifies intra-network semantic and local alignment without requiring more complex architectures. However, current distillation utilizes only MSE on raw feature maps; potential future directions include:
- Incorporating perceptual or attention-based distillation.
- Cross-model or multi-task distillation regimes.
- Soft-label (temperature-scaled) distillation at the logit level.
- Extension to volumetric (3D) architectures and multi-modal input.
- Adaptive per-layer or per-task distillation weighting (Shao et al., 27 Jan 2026).
Dual Self-distillation for VM-UNet underscores the potential of leveraging feature-level regularization strategies to achieve segmentation improvements in medical imaging, maintaining computational efficiency and architectural simplicity. The method aligns conceptually with other dual self-distillation strategies in U-shaped networks, such as the volumetric approach in Banerjee et al. (Banerjee et al., 2023), though DSVM-UNet uniquely targets linear-time, Vision Mamba-based 2D UNet architectures.