Conditional Diffusion Posterior Alignment for Sparse-View CT Reconstruction

Published 23 Apr 2026 in eess.IV, cs.CV, and cs.LG | (2604.21960v1)

Abstract: Computed Tomography (CT) is a widely used imaging modality in medical and industrial applications. To limit radiation exposure and measurement time, there is a growing interest in sparse-view CT, where the number of projection views is significantly reduced. Deep neural networks have shown great promise in improving reconstruction quality in sparse-view CT, especially generative diffusion models. However, these methods struggle to scale to large 3D volumes due to several reasons: (i) the high memory and computational requirements of 3D models, (ii) the lack of large 3D training datasets, and (iii) the inconsistencies across slices when using 2D models independently on each slice. We overcome these limitations and scale diffusion-based sparse-view CT reconstruction to large 3D volumes by combining conditional diffusion with explicit data consistency. We propose Conditional Diffusion Posterior Alignment (CDPA) to enable scalable 3D sparse-view CT reconstruction. A 2D U-Net diffusion model is conditioned on an initial 3D reconstruction to improve inter-slice consistency, combined with data-consistency alignment to match measured projections. Experiments on synthetic and real Cone Beam CT (CBCT) data show state-of-the-art performance, with ablations that confirm the synergistic effects of the proposed pipeline. Finally, we show that the same principles also strengthen fast denoising U-Nets, yielding near-diffusion quality at a fraction of the computational cost.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces Conditional Diffusion Posterior Alignment (CDPA), integrating conditional diffusion with data-consistency optimization to enhance sparse-view CT reconstruction.
The paper demonstrates state-of-the-art performance on synthetic and real datasets by robustly reducing artifacts and preserving anatomical details in high-resolution volumes.
The paper also presents an efficient FDK-denoising branch with slice-aware conditioning and uncertainty quantification, offering a practical solution for clinical CT imaging.

Conditional Diffusion Posterior Alignment for Sparse-View CT: A Detailed Examination

Motivation and Problem Context

Sparse-view CT reconstruction is central to minimizing both X-ray dose and acquisition time in medical and industrial tomography. The reduction in available projection views exacerbates the inherent ill-posedness of tomographic reconstruction, particularly in CBCT where volumetric coupling introduces pronounced streak artifacts and severe undersampling challenges. Traditional analytical algorithms (FDK/FBP) and hand-crafted iterative solvers break down in these regimes, either by producing structured artifacts or by extensive computational overheads with suboptimal reconstructions. Data-driven methods—particularly deep convolutional networks—have demonstrated artifact suppression capabilities, but scaling learned models to high-resolution 3D remains problematic due to memory, data, and architectural bottlenecks, especially with diffusion models.

Methodological Advances

This work introduces Conditional Diffusion Posterior Alignment (CDPA), which leverages the strengths of diffusion-based generative models while addressing their practical limitations for large 3D CBCT volumes. The framework integrates three synergistic ingredients:

Conditional Diffusion Modeling: A 2D U-Net diffusion model is trained to predict denoising scores conditional on an initial FDK (or FBP) reconstruction, effectively learning $p(x_t|FDK(y), t)$ and incorporating global volumetric priors via slice conditioning. The FDK reconstruction provides a context prior and stabilizes the conditional generation, promoting inter-slice volumetric consistency and alleviating hallucinations typical in naive slice-wise diffusion.
Explicit Data-Consistency Alignment: Posterior samples are refined through data-consistency-driven optimization over the denoised estimate, utilizing a loss $L_{\text{DC}}$ penalizing deviation between measured and projected estimates. This is realized via alternating updates in image space followed by resampling to diffusion space—a computationally tractable alternative to end-to-end backpropagation through the diffusion chain.
Slice and Position-Aware Conditioning: Recognizing the spatial variability in CBCT artifacts, slice indices are encoded (e.g., via cross-attention) to enable position-aware artifact removal, improving uniformity of reconstruction quality throughout the volume.

Besides CDPA, the study revisits FDK-denoising models (modern U-Nets with cross-attention and slice encoding), supplemented by an efficient inference-time data-consistency fine-tuning step using gradient descent, ensuring strict measurement-consistent outputs.

Experimental Evaluation

State-of-the-art performance is established on both synthetic and real CBCT datasets (dental, spine, and walnut), at resolutions up to $501^3$ voxels. Training and inference protocols reflect clinical conditions, using as few as 20–180 uniformly distributed projection views.

Key numerical findings include:

CDPA with posterior averaging ( $\mu$ (CDPA)) achieves the highest PSNR/SSIM across all datasets tested:
- On Walnut (20 views): $30.71 \pm 0.75$ dB (PSNR), $0.815 \pm 0.012$ (SSIM)
- On Dental (20 views): $34.76 \pm 0.67$ dB, $0.919 \pm 0.010$
Fine-tuned FDK-denoiser models, with positional encoding, nearly match CDPA performance at a fraction ( $\sim\!\!18\times$ ) of the computational cost.

Comparison with recent strong baselines (GAAL, S-STAR Net, DIF-Net) confirms that CDPA outperforms prior supervised and generative techniques (Table 1 in paper).

Visual qualitative results demonstrate artifact suppression, anatomical detail preservation, and enhanced 3D consistency, especially notable in high-resolution walnut scans. In Figure 1 (presented below), coronal, axial, and sagittal slices expose the granularity of improvement from the test set using 20 views.

Figure 1: Slices (coronal, axial, sagittal; top-to-bottom) of walnut reconstructions at $256^3$ resolution from the test dataset using 20 uniformly spaced views from the 1200 available (middle scan).

Additional figures illustrate similar performance on synthetic dental/spine datasets and high-resolution progressive improvements as view counts increase.

Ablation and Efficiency Analyses

A critical ablation (see Figure 2) dissects the roles of slice conditioning and fine-tuning, affirming that both are necessary to close the performance gap to diffusion models. Notably, computational efficiency is substantially increased in the FDK-denoising branch, recommending it for real-time or resource-limited scenarios.

Figure 2: Left—Runtime comparison; conditional diffusion models require fewer data-consistency steps while FDK-denoising with fine-tuning is dramatically faster. Right—Ablation study establishes essential contributions from slice conditioning and data-consistency fine-tuning for state-of-the-art performance.

Uncertainty Quantification

The study presents a compelling uncertainty quantification (UQ) analysis leveraging sample-wise variance from the diffusion posterior. As shown in Figure 3, the standard deviation of posterior samples strongly correlates ( $L_{\text{DC}}$ 0) with absolute reconstruction error, supporting practical UQ and error flagging in clinical settings.

Figure 3: Correlation between sample STD and reconstruction error for the $L_{\text{DC}}$ 1 walnut dataset (20 projections); strong positive correlation demonstrates utility of diffusion sample variance as a spatially localized UQ metric.

The ROC curves for detection of high-error voxels (AUC up to 0.961 for the most severe errors) underline the discriminative strength of this UQ strategy.

Implications, Limitations, and Future Directions

The demonstrated scalability of CDPA—using conditional 2D slice-wise models with explicit data-consistency refinement—makes high-fidelity, high-resolution CBCT reconstruction tractable without extensive 3D networks or prohibitively large training datasets. This pragmatic approach advances practical deployment in clinical and industrial CT, making aggressive dose/time reduction plausible with minimal loss in fidelity.

However, limitations persist. While UQ is promising, diffusion-based posteriors do not constitute the true Bayesian posterior; formal guarantees on calibration and coverage are not provided, especially in the presence of model mismatch or severe undersampling. Advancements may arise from integrating principled uncertainty quantification (e.g., likelihood mixing, MCMC-based Bayesian inference, or equivariant bootstrapping). Additionally, scaling data-consistency refinement and exploring latent diffusion for even larger volumes remain active areas.

Conclusion

Conditional Diffusion Posterior Alignment represents a significant methodological improvement for sparse-view CBCT reconstruction, resolving inter-slice inconsistencies and enforcing explicit measurement consistency, while remaining scalable to high-resolution volumes. The evidence presented demonstrates superior numerical and visual performance relative to established methods, with practical uncertainty quantification and efficient FDK-denoiser-based alternatives. These developments set a new bar for data-driven CT reconstruction under severe measurement constraints and open multiple avenues for future research in robust, uncertainty-aware inverse imaging.

Markdown Report Issue