DDoS-UNet for Dynamic MRI Super-Resolution
- DDoS-UNet is a deep learning architecture that leverages dual-channel inputs to combine spatial and temporal information for high-resolution dynamic MRI reconstruction.
- It employs a modified 3D UNet with trilinear upsampling and 1×1×1 convolutions, effectively reducing artefacts and preserving anatomical consistency.
- Quantitative results demonstrate significant improvements over baseline methods, with higher SSIM, PSNR, and lower NRMSE under extreme undersampling.
DDoS-UNet is a deep learning architecture specifically designed for super-resolution in dynamic magnetic resonance imaging (MRI), addressing the central limitation posed by the spatio-temporal trade-off inherent to rapid MR data acquisition. Traditional approaches either recover spatial detail at the expense of temporal resolution or treat each temporal frame independently, thus losing temporal coherence. DDoS-UNet introduces a dynamic dual-channel 3D UNet framework that explicitly leverages both spatial and temporal information, yielding enhanced reconstructions of dynamic MRI sequences even under extreme undersampling conditions (Chatterjee et al., 2022).
1. Spatio-Temporal Trade-off in Dynamic MRI
Dynamic MRI seeks to visualize organ motion or physiological changes, necessitating the acquisition of image volumes on sub-second timescales. The main limitation is the spatio-temporal trade-off: acquiring a small percentage of k-space per time frame (e.g., 4% of lines in the phase-encode direction) yields a theoretical acceleration factor of 25 but leads to pronounced spatial blurring due to the loss of high-frequency information. Standard super-resolution approaches based on single-image models process each time point as an independent entity, neglecting inter-frame redundancy and temporal consistency. DDoS-UNet circumvents this by incorporating information from both a high-resolution static “planning” scan and previously reconstructed frames, effectively deploying spatial and temporal priors at each step (Chatterjee et al., 2022).
2. Architecture: Modified 3D UNet with Dual-Channel Input
The DDoS-UNet design builds upon the standard 3D UNet encoder-decoder with skip-connections, introducing two pivotal architectural modifications:
- Dual-channel input: The first convolutional layer receives as input two channels—the current low-resolution (LR) volume (trilinearly upsampled to the high-resolution grid) and a high-resolution prior, which is the static planning scan for the initial time frame and the network’s previous super-resolved (SR) output for subsequent frames.
- Decoder: Trilinear Upsampling with 1×1×1 Convolution: Instead of transposed convolutions, each decoder up-block performs trilinear upsampling by a factor of two, followed by a 1×1×1 convolution (to mitigate checkerboard artefacts), and then two standard 3×3×3 convolutions with ReLU activations.
The network comprises three down-sampling and three up-sampling blocks. The down-sampling pathway doubles feature maps (starting at 64), each block comprising two 3×3×3 conv + ReLU layers and an average pooling layer (kernel size 2), while the up-sampling pathway reduces feature map dimensionality and concatenates skip-connections. A final 1×1×1 convolution produces the SR output volume (Chatterjee et al., 2022).
3. Stepwise Temporal Inference
DDoS-UNet operates in two distinct temporal phases for each sequence:
- Antipasto (Initial) Phase: For the first time-point (), Channel 1 is the LR input and Channel 2 is the static HR planning scan. The network outputs the initial super-resolved frame .
- Recursive Phase: For subsequent time-points , Channel 1 is the current LR, and Channel 2 is the SR output from the previous frame (). This enables temporal propagation of anatomical detail and consistency throughout the series.
This recursive inference bootstraps the reconstruction at each time step using previously inferred high-resolution information, thus regularizing reconstructions for both spatial fidelity and temporal stability (Chatterjee et al., 2022).
4. Mathematical Formulation and Loss Functions
In single-frame super-resolution, the mapping is . DDoS-UNet extends this to a dual-channel formulation:
The objective is enforced via a perceptual loss. Multi-scale features are extracted using a pre-trained, frozen UNet MSS; the loss is computed as the norm of the differences between predicted and ground-truth features at multiple scales. This penalizes discrepancies in perceptual structure as opposed to pixel-wise error, encouraging sharper output and structural fidelity.
Evaluation of reconstruction quality utilizes the structural similarity index (SSIM):
where and are local means and standard deviations (with constants , for stabilization).
Undersampling fraction yields an acceleration factor ; 4% sampling equates to (Chatterjee et al., 2022).
5. Training Protocol and Data
The training set uses the CHAOS abdominal T1 dataset (40 subjects), “animated” by random elastic deformations into synthetic 25-frame sequences to simulate breathing motion, partitioned into 70% train and 30% validation. Real test data consists of five volunteers, each with a breath-hold static scan and a free-breathing 25-frame dynamic MRI series. All data are retrospectively undersampled in-plane to retain 10%, 6.25%, or 4% of central k-space.
Volumes are trilinearly upsampled prior to input. Training uses batch size 1, the Adam optimizer ( learning rate), and 100 epochs. The perceptual loss module is frozen throughout (Chatterjee et al., 2022).
6. Quantitative Performance and Comparative Analysis
Quantitative evaluation under the most aggressive undersampling (4% centre k-space, ) yields:
- DDoS-UNet: SSIM = , PSNR = dB, NRMSE =
- Baselines:
- Trilinear interpolation: SSIM
- Zero-padding: SSIM
- Single-input UNet (static): SSIM
- Single-input UNet (dynamic): SSIM
For less aggressive undersampling (6.25%, 10%), SSIM improves to approximately and , respectively. All improvements over the baselines are statistically significant (Mann–Whitney U, ). DDoS-UNet is computationally efficient, reconstructing volumes at approximately $0.36$ seconds per frame (Chatterjee et al., 2022).
7. Architectural Innovations and Impact
Key architectural innovations include:
- Dual-channel learning: Forces the network to learn two intertwined representations: a spatial SR mapping and a temporal regularizer that models inter-frame coherence. This is formalized by , where and correspond to spatial and temporal mappings, respectively.
- Trilinear upsampling plus convolution: Eliminates checkerboard artefacts without incurring prohibitive computational burden.
- Whole-volume processing: Ensures anatomically consistent priors even under deformations (e.g., breathing), avoiding the misalignment problems inherent to patch-based techniques.
- Perceptual loss: Regularizes outputs toward structural fidelity and enhanced sharpness, as opposed to pixel-wise minimization.
DDoS-UNet demonstrates that by orchestrating a lightweight 3D UNet backbone with explicit high-resolution and temporal priors, dynamic MRI can achieve rapid, high-fidelity reconstructions, effectively breaking the conventional spatio-temporal compromise (Chatterjee et al., 2022).