CSDI: Conditional Score-based Diffusion Imputation

Updated 2 August 2025

Conditional Score-based Diffusion Imputation (CSDI) is a framework that uses conditional score-based diffusion models and Transformers to impute missing values in multivariate time series.
CSDI demonstrates substantial accuracy improvements and superior uncertainty quantification compared to autoregressive and unconditional generative models in domains like healthcare and environmental data.
By leveraging two-dimensional attention mechanisms, CSDI effectively captures both temporal dependencies and cross-feature correlations for robust probabilistic imputation.

Conditional Score-based Diffusion Imputation (CSDI) is a framework for probabilistic imputation of missing values in multivariate time series using conditional score-based diffusion models. Unlike autoregressive or unconditional generative models, CSDI explicitly learns the conditional distribution of missing (unobserved) entries given the observed data, harnessing a self-supervised, Transformer-based architecture and the denoising diffusion probabilistic modeling paradigm. The approach has demonstrated substantial improvements in imputation accuracy and uncertainty quantification across healthcare and environmental datasets, and serves as a basis for further research and adaptation to other domains.

1. Foundations and Motivation

CSDI was developed to address the limitations of traditional autoregressive models and unconditional diffusion models for time series imputation (Tashiro et al., 2021). In autoregressive approaches, missing values are sequentially predicted, which can cause error accumulation and limited exploitation of global dependencies. Score-based diffusion models, which gradually denoise random noise into samples from a target distribution, had shown strong generative capabilities in image and audio domains. CSDI adapts this principle to imputation by explicitly conditioning the reverse diffusion process on all available observed data, thus capturing temporal and cross-feature correlations within highly incomplete time series.

Key properties:

Learns the full conditional distribution $p(\mathbf{x}_0^u \mid \mathbf{x}_0^o)$ for unobserved (imputation targets) $\mathbf{x}_0^u$ conditioned on observed values $\mathbf{x}_0^o$ .
Avoids error propagation inherent in sequential/autoregressive methods.
Enables probabilistic imputation with uncertainty quantification, important for non-random missingness.

2. Model Architecture and Training Procedure

CSDI extends the Denoising Diffusion Probabilistic Model (DDPM) framework to the conditional domain:

Forward (Noising) Process:

$q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N} \left(\mathbf{x}_t; \sqrt{\alpha_t} \mathbf{x}_0, (1 - \alpha_t) I \right)$

Conditional Reverse (Denoising) Process:

$p_\theta\left(\mathbf{x}_t^u \mid \mathbf{x}_t^u, \mathbf{x}_0^o \right)$

with denoising function $\epsilon_\theta(\mathbf{x}_t^u, t \mid \mathbf{x}_0^o)$ , leading to

$\mu_\theta\left(\mathbf{x}_t^u, t \mid \mathbf{x}_0^o \right) = \mu^{\text{DDPM}}\left(\mathbf{x}_t^u, t, \epsilon_\theta(\mathbf{x}_t^u, t \mid \mathbf{x}_0^o)\right)$

$\sigma_\theta\left(\mathbf{x}_t^u, t \mid \mathbf{x}_0^o \right) = \sigma^{\text{DDPM}}(\mathbf{x}_t^u, t)$

Training Objective:

A subset of observed entries is randomly masked to serve as imputation targets ( $\mathbf{x}_0^u$ ); noise is added and the model learns to predict the noise given the noised data and the observed values:

$\min_\theta \mathbb{E}_{\mathbf{x}_0, t, \epsilon}\left[\left\| \epsilon - \epsilon_\theta(\mathbf{x}_t^u, t \mid \mathbf{x}_0^o) \right\|^2 \right]$

This self-supervised approach is analogous to masked language modeling strategies.

Conditioning and Correlation Exploitation:
- A temporal Transformer for capturing dependencies along each feature’s sequence.
- A feature Transformer for dependencies among features at each time step.

These two-dimensional attention mechanisms enable learning of both temporal and cross-feature structure in the observed data.

3. Empirical Performance and Comparative Evaluation

CSDI was benchmarked on high-missingness healthcare time series (PhysioNet Challenge 2012; 4000 patients, 35 variables, 48 time steps, ~80% missing) and environmental air quality data (PM2.5 from 36 Beijing stations, ~13% missing). Key metrics:

Probabilistic Imputation: Continuous Ranked Probability Score (CRPS).
Deterministic Imputation: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE).

CSDI achieved:

40–65% CRPS reduction over probabilistic baselines (e.g., GP-VAE, multitask GP, V-RIN).
5–20% MAE improvement over deterministic baselines (e.g., BRITS, GLIMA).
Superior uncertainty quantification compared to unconditional diffusion approaches, owing to explicit conditional training.

The architecture demonstrated competitive or superior results in both interpolation (irregularly sampled time series) and probabilistic forecasting, e.g., performing comparably to TimeGrad and Transformer MAF on electricity and traffic datasets.

4. Broader Applications and Extensions

CSDI’s framework naturally generalizes:

Interpolation: Arbitrary missing values (including non-aligned time points) can be imputed by leveraging the conditional denoising process.
Probabilistic Forecasting: Generating full predictive distributions for future points, offering joint modeling of uncertainty.
Portability: While the paper focuses on time series, the conditional score-based diffusion mechanism is adaptable to other modalities (e.g., tabular data with feature tokenization or handling categorical variables via discrete diffusion (Zheng et al., 2022)).

Methodological innovations such as masking-based self-supervised training and two-dimensional attention further facilitate application to settings with complex feature relationships and high-dimensionality.

5. Advantages, Limitations, and Theoretical Considerations

Advantages:

Direct modeling of conditional distributions eliminates severe error propagation and enables tight uncertainty estimates.
Flexible, uncertainty-aware output supports risk-sensitive downstream analytics.
Two-dimensional Transformer attention mechanisms effectively capture rich temporal and cross-feature dependencies.

Limitations:

Computational cost is potentially higher than purely deterministic methods, due to iterative denoising steps.
Sampling efficiency could be further improved (e.g., using DDIM-based solvers).
While empirically robust, application to domains with extremely sparse observations may demand additional architectural or training innovations.

Theoretical Aspects:

The conditional denoising score matching loss is grounded in the denoising score matching framework, providing theoretical guarantees for learning the true conditional score under suitable modeling assumptions.

6. Future Directions

CSDI’s modular framework has prompted further research in several directions:

Accelerating Reverse Process: Integrating ODE solvers or advanced step-skip strategies to reduce sampling time (Tashiro et al., 2021).
Downstream Integration: Employing improved imputations to boost predictive performance in classification or regression.
Domain Expansion: Applying conditional score-based diffusion to inverse problems (e.g., infinite-dimensional Bayesian inverse problems (Baldassari et al., 2023), mechanics (Dasgupta et al., 19 Jun 2024)), tabular data (Zheng et al., 2022), and multi-modal data.
Improved Training Regimes: Alternative self-supervised maskings, hybridization with other generative modeling paradigms, and end-to-end co-training with deterministic base predictors (analogous to recent “two-stage” or residual diffusion methods).

7. Summary Table: CSDI Model Workflow

Stage	Purpose	Mechanism
Conditioning	Embed observed data	Pad zeros + binary mask
Attention Encoding	Capture time and feature dependencies	Temporal + feature Transformer
Forward Diffusion	Add noise to missing/pseudo-missing	Gaussian noise schedule
Reverse Diffusion	Denoise missing entries	Conditional score-based
Training Loss	Score matching for denoising	$\ell_2$ loss on noise

This workflow enables direct learning of $p(\mathbf{x}_0^u \mid \mathbf{x}_0^o)$ , providing accurate and uncertainty-aware imputation for structured missingness in high-dimensional time series.