SPADE-LDM: 3D Cardiac MRI Synthesis
- SPADE-LDM is a 3D conditional generative framework that synthesizes high-fidelity LGE cardiac MRI volumes from composite semantic masks encoding anatomical labels and tissue clusters.
- Its two-stage architecture combines a 3D convolutional autoencoder with a latent diffusion U-Net enhanced by spatially-adaptive (SPADE) conditioning for precise anatomical structure reproduction.
- Empirical results demonstrate significant improvements in segmentation metrics and anatomical coherence compared to baseline models such as Pix2Pix and SPADE-GAN.
SPADE-LDM is a 3D conditional generative framework for synthesizing late gadolinium-enhanced (LGE) cardiac MRI volumes from composite semantic masks that encode both anatomical labels and tissue clusters. It integrates spatially-adaptive (SPADE) conditioning with latent diffusion modeling (LDM) within a two-stage architecture, targeting high-fidelity, label-conditioned image synthesis to augment scarce medical imaging data, specifically for improving the segmentation of complex cardiac structures such as the left atrial wall and endocardium (Al-Sanaani et al., 8 Jan 2026).
1. Latent Diffusion Modeling in 3D Medical Image Synthesis
SPADE-LDM employs a two-phase latent diffusion process adapted for 3D volumetric medical images. The framework encodes a real MRI volume into a latent code using a pretrained variational autoencoder (VAE). The forward noising process in latent space generates a Markov chain with
where the cosine noise schedule governs . The reverse denoising process is performed by a 3D U-Net , which estimates the noise vector added at each step, optimizing the standard DDPM loss under semantic mask conditioning :
Classifier-free guidance is implemented by randomly replacing with a null mask during 10% of training steps. During inference, the conditional and unconditional score estimates are linearly interpolated with a guidance weight of 1.5.
2. SPADE Conditioning for Semantic Mask Control
SPADE-LDM incorporates SPADE (Spatially-Adaptive Denormalization) conditioning at every residual block within the diffusion decoder. The semantic mask contains one-hot channels for anatomical labels (endo=1, wall=2) and unsupervised tissue clusters from intensity-based k-means (with in the baseline configuration). Each SPADE normalization computes
where and are learned by two CNNs from the corresponding label channels at each spatial location. This enables spatially-adaptive modulation of decoder features, driving anatomical and textural alignment with the input masks. Empirically, conditioning on composite masks (endo, wall, plus clusters) yields more anatomically coherent context than using only sparse (endo+wall) labels.
3. Network Architecture
The framework consists of two sequential stages:
- Stage 1: 3D Convolutional Autoencoder
- Encoder: 4 residual down-sampling blocks (3×3×3 convolutions, GroupNorm, SiLU), each dividing resolution by 2 and doubling channels; final bottleneck has 8 channels.
- Decoder: 4 symmetric up-sampling blocks with 3D convolutions, GroupNorm, SiLU, trilinear upsampling. Each block is preceded by a SPADE block conditioned on the (upsampled) semantic mask.
- Discriminator: PatchGAN with gradient penalty (introduced after 10 epochs).
- Stage 2: Latent Diffusion U-Net
- Inputs: Noised latent , 128-dimensional timestep embedding, and semantic mask downsampled to .
- U-Net: 4 downsampling/upsampling levels, SPADE-residual blocks, optional self-attention at bottleneck, skip connections; channel widths: 8, 16, 32, 64.
- Output: 3×3×3 convolution projects features to .
4. Training Protocol and Objectives
Training unfolds in two stages:
- Autoencoder (Stage 1):
- Reconstruction loss:
where is a perceptual loss via a pretrained MedicalNet ResNet-50. - Adversarial loss:
- Diffusion Model (Stage 2):
- Denoising loss: see above.
- Shape consistency loss:
with a frozen 3D U-Net segmenter and ground-truth mask. - Total loss:
Training is performed for 200 epochs in each stage (batch size 1), using Adam optimizers with distinct hyperparameters for each subcomponent. The bottleneck has 8 channels with spatial downsampling by a factor of 8. Data augmentation incorporates affine transformations, elastic deformations, intensity noise, gamma shifts, and simulated bias-field effects. All LGE MRI volumes and masks are resampled to at 1 mm isotropic resolution.
5. Synthetic Image Quality and Downstream Segmentation Performance
SPADE-LDM achieves state-of-the-art synthesis fidelity on LGE MRI benchmarks, demonstrably outperforming both Pix2Pix and SPADE-GAN baselines in FID, MMD, MS-SSIM, and PSNR metrics (see table below):
| Model | FID | MMD | MS-SSIM | PSNR (dB) |
|---|---|---|---|---|
| Pix2Pix | 40.821 | 36.890 | 0.763 | 23.067 |
| SPADE-GAN | 7.652 | 4.433 | 0.811 | 23.542 |
| SPADE-LDM | 4.063 | 2.656 | 0.826 | 24.792 |
The synthetic LGE volumes, when combined with real data, raise the Dice score for a 3D U-Net LA cavity segmentation from (real only) to (real + SPADE-LDM synthetic), a statistically significant improvement (, one-tailed Wilcoxon). Qualitative evaluations highlight SPADE-LDM's capability for realistic wall thickness reproduction (2–3 mm), gadolinium texture, background preservation, and cross-slice anatomical continuity.
6. Ablation Experiments and Analysis
Ablation studies demonstrate that using composite semantic masks (endo, wall, and k-means clusters) produces more anatomically complete and coherent context than sparse masks (endo+wall alone), and removal of the shape-consistency loss degrades wall fidelity (rise in FID 2%) and reduces segmentation performance when used for augmentation. Classifier-free guidance with a weight of 1.5 improves both shape adherence and contrast relative to unconditional or fully conditional sampling. SPADE-based normalization is critical for precise structure preservation at decoder and diffusion scales, with group normalization alone yielding suboptimal results.
Model progression across Pix2Pix, SPADE-GAN, and SPADE-LDM reveals increasingly sophisticated capture of both global anatomy and local texture, with SPADE-LDM providing best fidelity in challenging regions such as the thin LA wall and surrounding myocardium.
7. Significance and Future Directions
SPADE-LDM demonstrates that 3D latent diffusion models conditioned with SPADE on composite anatomical masks can generate realistic, semantically controlled medical images, yielding significant gains as training data for segmentation models. The integration of multi-class semantic guidance, advanced loss composition (reconstruction, adversarial, denoising, and anatomical shape), and classifier-free guidance is instrumental for robust high-resolution synthesis. A plausible implication is that similar architectures could be adapted for other anatomical regions or modalities where annotated sample scarcity and morphological complexity limit conventional supervised learning.
The results indicate that SPADE-LDM provides an effective framework for improving the segmentation of under-represented cardiac structures and may serve as a blueprint for future developments in semantically-conditioned, data-efficient 3D generative imaging (Al-Sanaani et al., 8 Jan 2026).