Wave-GMS: Efficient Medical Segmentation
- The paper introduces Wave-GMS, a lightweight multi-scale generative model that delivers state-of-the-art segmentation accuracy using only ~2.6M trainable parameters.
- It employs multi-resolution Haar wavelet feature extraction and a frozen Tiny-VAE to ensure efficient training and robust cross-domain performance.
- The approach outperforms larger architectures in key metrics like Dice score and HD95, making it highly practical for deployment in resource-constrained clinical settings.
Wave-GMS is a lightweight multi-scale generative model specifically designed for medical image segmentation in resource-constrained settings. It achieves state-of-the-art segmentation accuracy with exceptionally low memory and computational requirements by integrating multi-resolution wavelet feature extraction, compact latent-space mapping, and a distilled generative backbone. The architecture enables training with large batch sizes on consumer GPUs and exhibits strong generalizability across imaging domains and acquisition protocols, making it highly practical for real-world deployment in healthcare environments (Ahmed et al., 3 Oct 2025).
1. Model Architecture
Wave-GMS consists of three main components:
- Multi-Resolution Encoder: The input image is decomposed via multi-level 2D Discrete Haar Wavelet Transform (DWT), producing subbands (LL, LH, HL, HH) at each level. For example, after three wavelet levels, the image is downsampled by a factor of eight. At each level :
Subsequently, features are extracted, downsampled, and concatenated:
where is an aggregation module.
- Latent Space Foundation: A pretrained, frozen Tiny-VAE encoder/decoder (a compact distillation of SD-VAE) encodes both the image () and the ground-truth segmentation mask () into latent representations.
- Latent Mapping Model (LMM): This is a lightweight encoder-decoder network (without explicit up/downsampling) comprised of a stem convolution layer and four encoder/decoder "ResAttn" blocks (each combining residual units and spatial self-attention). LMM predicts the segmentation mask latent (), which is then decoded into an image-space mask via the frozen Tiny-VAE decoder:
Only the multi-resolution encoder (~1.03M parameters) and LMM (~1.56M parameters) are trainable, totaling ~2.6M parameters. Tiny-VAE encoder and decoder (each ~1.22M parameters) remain frozen.
2. Training Efficiency
Key features driving training efficiency:
- Parameter Compactness: The total trainable parameter count (~2.6M) is vastly smaller than existing discriminative and generative architectures, which can exceed hundreds of millions (e.g., SDSeg: 329M, MedSegDiff-V2: 129.4M).
- Memory Footprint: By freezing the Tiny-VAE, Wave-GMS avoids loading substantial pretrained models onto the GPU, enabling training with large batch sizes on 12GB GPUs (such as RTX 3060).
- Multi-Scale Features: Haar DWT efficiently extracts frequency-localized features without spatial redundancy.
- Loss Functions: Training employs a compound objective:
where - is a soft-Dice loss (with deep supervision across decoder layers), - is an reconstruction loss for latent consistency between and , - regularizes alignment between multi-scale and VAE latents.
- Optimizer: AdamW with cosine annealing learning rate scheduler.
3. Performance Evaluation
Wave-GMS was evaluated on four representative benchmarks:
| Dataset | Dice Score (%) | IoU (%) | HD95 (pixels) |
|---|---|---|---|
| BUS | 90.14 | 82.62 | 5.36 |
| BUSI | 82.31 | 73.42 | 18.46 |
| Kvasir-Instrument | ≈94.00 | – | – |
| HAM10000 | 93.93 | 89.37 | – |
In cross-domain generalizability experiments (i.e., training on BUS, testing on BUSI and vice versa), Wave-GMS consistently outperformed all state-of-the-art baselines, yielding the highest Dice scores and lowest HD95, confirming robustness to variation in data sources and acquisition protocols.
4. Comparative Analysis
Relative advantages of Wave-GMS include:
- Parameter Efficiency: Models such as U-Net (14–20M), SDSeg (329M), MedSegDiff-V2 (129.4M), and even GMS (Huo et al., 27 Mar 2024) (1.56M trainable, >80M total due to SD-VAE) require much greater resources. Wave-GMS leverages only ~2.6M trainable parameters with a compact ~1.2M parameter Tiny-VAE.
- Segmentation Accuracy: Across all metrics—Dice, IoU, HD95—Wave-GMS maintains or surpasses the performance of established discriminative (nnUNet, SwinUNet) and generative models (GMS, SDSeg), particularly in cross-domain scenarios.
- Resource Accessibility: The architecture enables training with large batches on affordable GPUs, directly addressing real-world deployment constraints in clinics and hospitals.
5. Practical Implications
Wave-GMS offers several operational benefits:
- Equitable Deployment: Its compactness and efficiency permit use in healthcare facilities without high-end compute infrastructure.
- Reduced Overfitting: The small trainable parameter set mitigates overfitting commonly observed in small-data medical contexts.
- Domain Generalization: Superior cross-protocol generalizability supports adaptation to novel imaging datasets without extensive retraining.
- Clinical Workflow Integration: Fast, robust segmentation accelerates diagnosis, treatment planning, and surgical guidance.
6. Future Directions
The paper identifies several research directions:
- 3D Extension: Adapting the architecture for volumetric segmentation (e.g., CT or MRI) to expand its scope.
- Novel Foundation Models: Exploring even more powerful, compact latent-space encoders/decoders for further improvements.
- Latent Alignment Optimization: Investigating advanced alignment losses and attention mechanisms to improve multi-scale–foundation model compatibility, further strengthening domain adaptation.
Conclusion
Wave-GMS integrates a multi-scale Haar DWT encoder, compact frozen generative backbone (Tiny-VAE), and a lightweight latent mapping model, achieving top-tier medical image segmentation accuracy and cross-domain generalization with minimal computational overhead. The design is specifically tailored for real-world healthcare deployment on cost-effective GPU hardware. The approach not only advances the efficiency frontier for medical segmentation networks but also sets the stage for future improvements through 3D extensions and foundation model innovations (Ahmed et al., 3 Oct 2025).