MedDX-FT: Frequency Decoupled Diffusion Model
- The paper shows that FDDM, by integrating a shared-latent VAE with a frequency-decoupled dual-path diffusion process, significantly improves anatomical fidelity and realism in MR-to-CT translation.
- It utilizes a two-stage pipeline with Sobel-based edge detection and Laplacian pyramid fusion to effectively separate and recombine frequency details.
- Quantitative results demonstrate FDDM’s superiority over competing models in FID, SSIM, and PSNR, validating its practical benefits for unsupervised image translation.
MedDX-FT (Frequency Decoupled Diffusion Model, FDDM) is an unsupervised framework for MR-to-CT medical image translation, distinguished by its frequency-decoupled dual-path diffusion process and structural guidance from a VAE-based module. FDDM addresses limitations in prior diffusion-based and generative adversarial models, focusing explicitly on anatomical faithfulness and realism when translating between unpaired MR and CT datasets (Li et al., 2023).
1. Architectural Overview
FDDM employs a two-stage pipeline integrating variational autoencoding and frequency-based conditional denoising. The first stage utilizes a shared-latent-space VAE (in the style of UNIT), receiving an MR image and its Sobel-derived edge map , and producing a coarse CT prediction and corresponding edges . Key losses include VAE reconstruction, Kullback-Leibler divergence, adversarial (GAN) discrimination, cycle-consistency, and rotation-consistency.
The second stage introduces a frequency-decoupled diffusion model. The forward diffusion process uses “blue-noise” perturbation, which enforces a low-pass filter effect: high-frequency components in are more greatly corrupted, yielding . Reverse diffusion proceeds via two parallel branches—
- Explicit (high-frequency) path: stochastic denoising with random noise
- Implicit (low-frequency) path: deterministic denoising
At each timestep, predictions from both paths are fused using a Laplacian pyramid, separating frequency bands for faithful recombination. The process is detailed via stepwise pseudocode, demonstrating the initialization, edge thresholding, Laplacian pyramid generation, and dual-path updates.
2. Initial Conversion Module
The initial conversion employs a shared-latent-space variant VAE with the following configuration:
- Inputs: ,
- Encoders/Decoders: , , ,
- Variational Posteriors: ,
- VAE Losses: Each domain regularizes latent distributions via KL divergence and enforces reconstruction via penalty
- Adversarial Losses: Enforces realistic output reconstruction in both domains
- Cycle-consistency Losses: Guarantees bijection between domains, penalizing mode collapse
- Rotation-consistency Losses: Stabilizes spatial structure by rotating inputs and outputs and minimizing discrepancies
The combined Stage 1 loss, , sums all terms. After training, inference proceeds via .
3. Frequency-Decoupled Diffusion Process
Forward Diffusion
The blue-noise forward step replaces standard Gaussian diffusion:
- Blue noise is defined such that , imparting greater noise to high-frequency components
Dual-path Reverse Diffusion
For timestep :
- Prediction: Clean estimates via each path
- Fusion via Laplacian pyramid: Extracts and recombines frequency bands from both predictions
- Branch updates: High-frequency (explicit, noisy) and low-frequency (implicit, noiseless)
Diffusion loss is the standard MSE:
This decoupling ensures anatomical structure is preserved through the implicit path, while the explicit path refines realistic detail.
4. Unpaired Data Training Protocol
FDDM’s architecture obviates the need for paired datasets. Stage 1 uses only unpaired MR and CT slices, learning a bidirectional shared representation. Stage 2 conditions its diffusion solely on generated coarse CT and its edge guidance, with the model trained on CT domain images independently. No paired supervision between MR and CT is applied at any phase.
5. Quantitative Comparisons and Metrics
Performance is evaluated using three core metrics:
- Fréchet Inception Distance (FID) for realism:
where are the mean and covariance of Inception features.
- Peak Signal-to-Noise Ratio (PSNR)
- Structural Similarity Index (SSIM)
Results on benchmark datasets (brain MR→CT; pelvis MR→CT) show that FDDM achieves the lowest FID (25.86 for brain, 29.20 for pelvis), surpassing CycleGAN, GcGAN, RegGAN, UNIT, MUNIT, SDEdit, and SynDiff. SSIM and PSNR are likewise matched or improved.
| Method | FID (brain) ↓ | SSIM (brain) ↑ | PSNR (brain) ↑ |
|---|---|---|---|
| CycleGAN | 62.41 | 0.8788 | 36.50 |
| GcGAN | 60.03 | 0.8841 | 36.96 |
| RegGAN | 73.76 | 0.8187 | 35.73 |
| UNIT | 49.71 | 0.8960 | 36.87 |
| MUNIT | 72.85 | 0.8449 | 35.79 |
| SDEdit | 75.99 | 0.7385 | 35.37 |
| SynDiff | 70.54 | 0.8632 | 37.13 |
| FDDM | 25.86 | 0.9144 | 38.08 |
Comparable improvements are shown in the pelvis dataset.
6. Ablation Studies and Module Impacts
Extensive ablation supports the functional contributions of FDDM’s components:
- Rotation-consistency loss in Stage 1 reduces FID from 48.69 (no RC) to 39.82 (with RC) and increases SSIM from 0.9014 to 0.9123.
- Dual-path reverse diffusion shows that omitting the implicit path worsens SSIM, while removing the explicit path degrades FID. Full dual-path yields optimal balance: FID=25.86, SSIM=0.9144, PSNR=38.08.
- Forward diffusion steps : Performance peaks at ; smaller under-corrects VAE errors while larger erodes anatomical integrity.
Qualitative comparisons confirm FDDM’s preservation of anatomical details—such as brain sulci and pelvic bone edges—relative to single-path or GAN/VAE methods.
7. Significance and Implications
FDDM’s innovations—shared-latent VAE structural extraction, frequency-adaptive blue-noise diffusion, and dual-path frequency-specific denoising—set a new standard for realism and anatomical fidelity in unsupervised medical image translation. The framework excels on unpaired MR→CT data, with state-of-the-art FID and competitive SSIM/PSNR, while its modularization demonstrates the necessity of frequency decoupling and structural guidance for high-impact clinical imaging applications. This suggests broader relevance for other unsupervised domain translation problems requiring precise structural preservation (Li et al., 2023).