Conditionally Whitened Diffusion Models

Updated 27 September 2025

CW-Diff is a diffusion framework that uses adaptive whitening based on conditional mean and covariance to address non-stationarity and structured noise.
It adapts both the forward and reverse processes by transforming data into a whitened space, thereby improving sample quality and reducing posterior divergence.
Applications in time series forecasting, imaging inverse problems, and wireless channel identification demonstrate significant performance gains.

Conditionally Whitened Diffusion Models (CW-Diff) generalize conventional score-based diffusion frameworks by integrating adaptive whitening operations informed by conditional statistics of the data or observations. This enables the diffusion processes to accommodate non-stationarity, heteroscedasticity, and structured noise, offering principled improvements for generative modeling and inverse problem inference in domains with complex dependencies.

1. Theoretical Foundations of Conditional Whitening for Diffusion

The primary goal of CW-Diff is to harmonize the forward noising process and the reverse denoising generative process with the local or conditional structure of the data distribution. Instead of always driving data to a standard Gaussian terminal distribution at the end of the forward process, CW-Diff modifies both the forward SDE/ODE and the reverse score-based sampler such that the noise manifold is mean-zero and isotropic only after a transformation determined by the data’s conditional mean and covariance.

Consider a multivariate setting where the true target distribution $P_{X|C}$ possesses conditional mean $\mu_{X|C}$ and covariance $\Sigma_{X|C}$ . The conventional approach yields suboptimal sample quality when these parameters deviate significantly from the zero mean and identity covariance imposed by standard Gaussianization. CW-Diff remedies this by replacing the terminal with $N(\hat{\mu}_{X|C}, \hat{\Sigma}_{X|C})$ , where estimators $\hat{\mu}_{X|C}$ and $\hat{\Sigma}_{X|C}$ are learned (e.g., through a Joint Mean–Covariance Estimator (JMCE) (Yang et al., 25 Sep 2025)). The posterior Kullback–Leibler divergence is provably reduced under a sufficient condition on estimation accuracy,

$\left[\min_i \hat{\lambda}_{X|C,i}\right]^{-1} \left(\|\mu_{X|C} - \hat{\mu}_{X|C}\|_2^2 + \|\Sigma_{X|C} - \hat{\Sigma}_{X|C}\|_n \right) + \sqrt{d_x} \| \Sigma_{X|C} - \hat{\Sigma}_{X|C} \|_F \leq \|\mu_{X|C}\|_2^2$

where $\| \cdot \|_n$ is the nuclear norm and $\| \cdot \|_F$ the Frobenius norm.

Conditional whitening is then formalized by the linear transformation

$X_0^{CW} = \hat{\Sigma}_{X|C}^{-0.5} (X_0 - \hat{\mu}_{X|C})$

and the forward SDE operated in the whitened space. The reverse process and score estimation are defined analogously, restoring the original scale and mean at final sample regeneration.

2. Parameterization and Modularity via Flexible Forward SDEs

CW-Diff inherits from a broader family of flexible, spatially parameterized forward SDE frameworks (Du et al., 2022). The forward process is written in general as:

$dX_t = f(X_t) dt + \sqrt{2R(X_t)}\, dW_t$

where the Riemannian metric $R(x)$ adapts the local noising geometry, and anti-symmetric terms $\omega_{ij}$ can be added to model mixing via Hamiltonian dynamics. Conditioning can be applied to $R(x)$ , $\omega$ , or the drift $f(x)$ , allowing the noising process to adapt both spatially and with respect to side information, such as class labels, observed measurements, or temporal context.

For conditionally whitened settings, the metric $R(x)$ may depend on $\hat{\Sigma}_{X|C}$ , with its functional dependence learned via neural networks (e.g., auxiliary encoders). This approach generalizes both fixed-geometry diffusion (VP SDE, critically-damped Langevin, etc.) and structured noise diffusion models (WS-Diff (Alido et al., 15 May 2025)), and ensures theoretical guarantees concerning stationarity and ergodicity.

3. Algorithmic Templates, Score Approximation, and Conditional Sampling

CW-Diff models benefit from unified algorithmic templates for training and sampling, such as those formalized by the random walk Tweedie-based framework (Park et al., 27 Nov 2024):

Training (score-matching):

$\mathcal{L}(\theta) = \mathbb{E}_{\sigma, x, x_\sigma} \left[ w(\sigma) \| x - r_\theta(x_\sigma, \sigma) \|^2 \right]$

with $r_\theta$ approximating the Tweedie denoiser for whitened data.

Sampling (conditional update):

$x_{k+1} = x_k + \tau_k [\nabla \log f_{x(\sigma_k)}(x_k) + \text{data consistency}] + \sqrt{2 \tau_k \mathcal{T}_k} \zeta_k$

where the score is efficiently approximated in the whitened space. Data consistency terms (e.g., measurement gradient $(1/\eta^2)A^T(y - Ax_k)$ for linear inverse problems) are incorporated directly, enabling exact conditional sampling when the measurement model is known.

Crucially, training and sampling schedules can be decoupled. Whitening often makes the noise manifold close to isotropic, and the flexibility in schedule selection (e.g., linear, sigmoid, or adaptive) can balance sample quality and convergence speed. Conditional structure is imposed by plugging in empirical or learned mean/covariance statistics, or by using joint score networks for both signal and measurement noise (Stevens et al., 2023).

4. Empirical Performance, Applications, and Benchmark Results

CW-Diff and related methods have demonstrated superior performance on both generative modeling and inverse problem tasks:

Time series forecasting: Empirical evaluation on multiple benchmarks (ETTh1, Weather, Solar-Energy, etc.) using CRPS, QICE, ProbCorr, and Conditional FID metrics show that CW-Diff consistently reduces predictive error, enhances variable correlation modeling, and mitigates distribution shift (Yang et al., 25 Sep 2025).
Imaging inverse problems: WS-Diff and joint noise–signal diffusion methods outperform conventional baselines in denoising, deblurring, and recovery under structured, non-isotropic noise (Alido et al., 15 May 2025, Stevens et al., 2023). Notably, tailored spectral inductive biases and preconditioned score estimation yield robust priors and improved PSNR and SSIM.
Wireless channel identification: Conditional diffusion, augmented with transformer architectures conditioned on scenario and noise level, achieves identification accuracy improvements exceeding 10% over conventional classifiers (Li et al., 14 Jun 2025).
Hybrid frequency and multi-scale generation: Extensions utilizing hybrid frequency representations (e.g., wavelet-Fourier) show that conditional diffusion in spectral domains enhances both global coherence and fine detail synthesis in high-fidelity image generation (Kiruluta et al., 4 Apr 2025).

5. Optimization Strategies and Loss Functions

Optimization is typically performed jointly for the conditional mean, covariance (JMCE), and score networks. The aggregate loss is designed to encourage accurate prediction, statistical consistency, and numerical stability (regularizing minimum eigenvalues of $\hat{\Sigma}$ ). For conditional whitening, the core loss integrates:

Mean reconstruction error ( $L_2$ )
Covariance discrepancy (nuclear and Frobenius norms)
Eigenvalue penalty ( $R_{\lambda_{\text{min}}}$ )
Forward consistency loss (especially in WS-Diff)

This composite objective ensures that the sufficient condition guaranteeing reduced KL divergence is satisfied. Hybrid training strategies (e.g., two-stage "Mix" training (Du et al., 2022)) further optimize the balance between forward SDE learning and reverse score matching.

6. Future Directions and Extensions

Current research in CW-Diff highlights several promising areas for further investigation:

Efficient inference: Combining momentum samplers, predictor–corrector schemes, or integrating continuous normalizing flows for faster reverse-time sampling.
Non-linear measurement models: Extending conditional whitening and structured score estimation to non-linear forward operators and domains beyond linear inverse problems.
Domain transfer: Applying CW-Diff to new modalities (e.g., video, remote sensing, medical imaging) by adapting conditional whitening to spatial-temporal and multi-channel correlations.
Adaptive and multi-scale frequency conditioning: Expanding the hybrid frequency paradigm with alternative transforms or dynamic masking in spectral or spatial domains.
Enhanced prior incorporation: Learning richer conditional statistics (mean, covariance, higher moments) via transformer-based JMCE modules, adaptive over sliding windows, or in complex autoregressive architectures (Yang et al., 25 Sep 2025).

CW-Diff thus constitutes a flexible, theoretically-grounded framework for generative modeling and inverse problem inference in environments characterized by structured, data-dependent noise and conditional dependencies. Its ability to integrate conditional whitening into both the forward and reverse diffusion processes substantially improves generative fidelity, posterior accuracy, and robustness to distribution shifts.