Wave-GMS: Efficient Medical Segmentation

Updated 6 October 2025

The paper introduces Wave-GMS, a lightweight multi-scale generative model that delivers state-of-the-art segmentation accuracy using only ~2.6M trainable parameters.
It employs multi-resolution Haar wavelet feature extraction and a frozen Tiny-VAE to ensure efficient training and robust cross-domain performance.
The approach outperforms larger architectures in key metrics like Dice score and HD95, making it highly practical for deployment in resource-constrained clinical settings.

Wave-GMS is a lightweight multi-scale generative model specifically designed for medical image segmentation in resource-constrained settings. It achieves state-of-the-art segmentation accuracy with exceptionally low memory and computational requirements by integrating multi-resolution wavelet feature extraction, compact latent-space mapping, and a distilled generative backbone. The architecture enables training with large batch sizes on consumer GPUs and exhibits strong generalizability across imaging domains and acquisition protocols, making it highly practical for real-world deployment in healthcare environments (Ahmed et al., 3 Oct 2025).

1. Model Architecture

Wave-GMS consists of three main components:

Multi-Resolution Encoder: The input image is decomposed via multi-level 2D Discrete Haar Wavelet Transform (DWT), producing subbands (LL, LH, HL, HH) at each level. For example, after three wavelet levels, the image is downsampled by a factor of eight. At each level $l$ :

$X_{MR}^l = [X_{LL}^l \| X_{LH}^l \| X_{HL}^l \| X_{HH}^l]$

Subsequently, features $F_l = \phi_l(X_{MR}^l)$ are extracted, downsampled, and concatenated:

$F = [↓↓(F_1) \| ↓(F_2) \| F_3],\quad z_{MR} = \mathcal{A}(F)$

where $\mathcal{A}$ is an aggregation module.

Latent Space Foundation: A pretrained, frozen Tiny-VAE encoder/decoder (a compact distillation of SD-VAE) encodes both the image ( $z_I = \mathcal{E}_{tiny}(I)$ ) and the ground-truth segmentation mask ( $z_M = \mathcal{E}_{tiny}(M)$ ) into latent representations.
Latent Mapping Model (LMM): This is a lightweight encoder-decoder network (without explicit up/downsampling) comprised of a stem convolution layer and four encoder/decoder "ResAttn" blocks (each combining residual units and spatial self-attention). LMM predicts the segmentation mask latent ( $\hat{z}_M = g_\theta^{LMM}(z_{MR})$ ), which is then decoded into an image-space mask via the frozen Tiny-VAE decoder:

$\hat{M} = \mathcal{D}_{tiny}(\hat{z}_M)$

Only the multi-resolution encoder (~1.03M parameters) and LMM (~1.56M parameters) are trainable, totaling ~2.6M parameters. Tiny-VAE encoder and decoder (each ~1.22M parameters) remain frozen.

2. Training Efficiency

Key features driving training efficiency:

Parameter Compactness: The total trainable parameter count (~2.6M) is vastly smaller than existing discriminative and generative architectures, which can exceed hundreds of millions (e.g., SDSeg: 329M, MedSegDiff-V2: 129.4M).
Memory Footprint: By freezing the Tiny-VAE, Wave-GMS avoids loading substantial pretrained models onto the GPU, enabling training with large batch sizes on 12GB GPUs (such as RTX 3060).
Multi-Scale Features: Haar DWT efficiently extracts frequency-localized features without spatial redundancy.
Loss Functions: Training employs a compound objective:

$\mathcal{L}_{total} = \mathcal{L}_{seg} + \mathcal{L}_{lm} + \mathcal{L}_{align}$

where - $\mathcal{L}_{seg}$ is a soft-Dice loss (with deep supervision across decoder layers), - $\mathcal{L}_{lm}$ is an $\ell_2$ reconstruction loss for latent consistency between $\hat{z}_M$ and $z_M$ , - $\mathcal{L}_{align} = 0.9 (1 - \cos(z_{MR}, z_I)) + 0.1 \|z_{MR} - z_I \|_1$ regularizes alignment between multi-scale and VAE latents.

Optimizer: AdamW with cosine annealing learning rate scheduler.

3. Performance Evaluation

Wave-GMS was evaluated on four representative benchmarks:

Dataset	Dice Score (%)	IoU (%)	HD95 (pixels)
BUS	90.14	82.62	5.36
BUSI	82.31	73.42	18.46
Kvasir-Instrument	≈94.00	–	–
HAM10000	93.93	89.37	–

In cross-domain generalizability experiments (i.e., training on BUS, testing on BUSI and vice versa), Wave-GMS consistently outperformed all state-of-the-art baselines, yielding the highest Dice scores and lowest HD95, confirming robustness to variation in data sources and acquisition protocols.

4. Comparative Analysis

Relative advantages of Wave-GMS include:

Parameter Efficiency: Models such as U-Net (14–20M), SDSeg (329M), MedSegDiff-V2 (129.4M), and even GMS (Huo et al., 27 Mar 2024) (1.56M trainable, >80M total due to SD-VAE) require much greater resources. Wave-GMS leverages only ~2.6M trainable parameters with a compact ~1.2M parameter Tiny-VAE.
Segmentation Accuracy: Across all metrics—Dice, IoU, HD95—Wave-GMS maintains or surpasses the performance of established discriminative (nnUNet, SwinUNet) and generative models (GMS, SDSeg), particularly in cross-domain scenarios.
Resource Accessibility: The architecture enables training with large batches on affordable GPUs, directly addressing real-world deployment constraints in clinics and hospitals.

5. Practical Implications

Wave-GMS offers several operational benefits:

Equitable Deployment: Its compactness and efficiency permit use in healthcare facilities without high-end compute infrastructure.
Reduced Overfitting: The small trainable parameter set mitigates overfitting commonly observed in small-data medical contexts.
Domain Generalization: Superior cross-protocol generalizability supports adaptation to novel imaging datasets without extensive retraining.
Clinical Workflow Integration: Fast, robust segmentation accelerates diagnosis, treatment planning, and surgical guidance.

6. Future Directions

The paper identifies several research directions:

3D Extension: Adapting the architecture for volumetric segmentation (e.g., CT or MRI) to expand its scope.
Novel Foundation Models: Exploring even more powerful, compact latent-space encoders/decoders for further improvements.
Latent Alignment Optimization: Investigating advanced alignment losses and attention mechanisms to improve multi-scale–foundation model compatibility, further strengthening domain adaptation.

Conclusion

Wave-GMS integrates a multi-scale Haar DWT encoder, compact frozen generative backbone (Tiny-VAE), and a lightweight latent mapping model, achieving top-tier medical image segmentation accuracy and cross-domain generalization with minimal computational overhead. The design is specifically tailored for real-world healthcare deployment on cost-effective GPU hardware. The approach not only advances the efficiency frontier for medical segmentation networks but also sets the stage for future improvements through 3D extensions and foundation model innovations (Ahmed et al., 3 Oct 2025).

PDF Markdown Chat (Pro)

References (2)

Wave-GMS: Lightweight Multi-Scale Generative Model for Medical Image Segmentation (2025)

Generative Medical Segmentation (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Wave-GMS.