CSDI: Conditional Diffusion Imputation

Updated 4 July 2026

CSDI is a probabilistic imputation method that models the conditional distribution of missing entries using iterative reverse diffusion while keeping observed data fixed.
It employs a dual Transformer attention mechanism to capture temporal and feature dependencies, enhancing the accuracy of reconstructions from noisy initializations.
Empirical results demonstrate significant improvements in CRPS and MAE over traditional baselines on healthcare and air quality datasets using both probabilistic and deterministic outputs.

Conditional Score-Based Diffusion Imputation (CSDI) is a probabilistic imputation method for multivariate time series with missing values that directly models the conditional distribution of missing entries given observed entries, $q(\mathbf{x}_0^\mu \mid \mathbf{x}_0^o)$ . Introduced in 2021 for time series imputation, and also applied to interpolation and probabilistic forecasting, it treats missing-value recovery as conditional reverse diffusion: the target region is initialized with noise and iteratively denoised while the observed values remain fixed as conditioning information throughout the generation process (Tashiro et al., 2021).

1. Problem setting and conceptual position

CSDI is formulated for multivariate time series

$\mathbf{X} \in \mathbb{R}^{K \times L},$

where $K$ is the number of features and $L$ is the number of time points. Missingness is represented by a mask

$\mathbf{M}^{time} \in \{0,1\}^{K \times L},$

with $m_{k,l}=1$ if a value is observed and $m_{k,l}=0$ if it is missing. A sample is written as $\{\mathbf{X}, \mathbf{M}^{time}, \mathbf{s}\}$ , where $\mathbf{s}$ denotes timestamps. Within this setup, CSDI targets three closely related tasks: standard imputation of arbitrary missing values, interpolation at irregularly sampled times, and forecasting of future time steps (Tashiro et al., 2021).

The method is positioned against several established imputation families. Autoregressive methods process time sequentially and often rely on RNN-like hidden-state dynamics; deterministic methods return a single imputed value; probabilistic latent-variable and Gaussian-process methods model uncertainty but may not exploit temporal-feature interactions as effectively; and naive unconditional diffusion approaches corrupt the observations themselves, weakening conditioning. CSDI is designed to avoid these limitations by explicitly learning the conditional reverse process for the missing region while leaving the observed region uncorrupted and available as context.

A recurrent misconception is that CSDI is simply a diffusion generator with masking added at input time. The original formulation is stricter: it learns the reverse dynamics of the missing part conditioned on the observed part, rather than approximating conditional imputation through an unconditional generative model. This distinction is central to its empirical behavior and to its later reuse as a canonical conditional diffusion backbone.

2. Conditional diffusion formulation

CSDI builds on DDPM-style diffusion. The forward process is

$q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) := \prod_{t=1}^{T} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}), \quad q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I}\right),$

with equivalent closed form

$\mathbf{X} \in \mathbb{R}^{K \times L},$ 0

Hence,

$\mathbf{X} \in \mathbb{R}^{K \times L},$ 1

The unconditional reverse chain is

$\mathbf{X} \in \mathbb{R}^{K \times L},$ 2

with

$\mathbf{X} \in \mathbb{R}^{K \times L},$ 3

CSDI replaces this with a conditional reverse process defined only on the imputation target: $\mathbf{X} \in \mathbb{R}^{K \times L},$ 4 where

$\mathbf{X} \in \mathbb{R}^{K \times L},$ 5

The denoiser is therefore a conditional noise predictor

$\mathbf{X} \in \mathbb{R}^{K \times L},$ 6

and training uses the standard DDPM noise-prediction objective, restricted to the imputation target: $\mathbf{X} \in \mathbb{R}^{K \times L},$ 7

The conditioning mechanism is operationally simple but methodologically important. Missing values are initialized as noise, whereas observed values are supplied as conditioning input. The network sees the noisy target region $\mathbf{X} \in \mathbb{R}^{K \times L},$ 8, the observed context $\mathbf{X} \in \mathbb{R}^{K \times L},$ 9, and a mask $K$ 0 indicating which positions are conditional observations. Observed values are not noised during sampling. This is what makes the method an imputation-specific conditional diffusion model rather than a generic unconditional diffusion model adapted post hoc (Tashiro et al., 2021).

3. Architecture, masking strategy, and implementation

A central architectural choice in CSDI is a 2D attention mechanism that models dependence along both axes of a multivariate time series. The denoiser uses a temporal Transformer layer that attends across time for each feature and a feature Transformer layer that attends across features for each time point. The rationale is explicit: imputation quality improves when a missing value can be inferred from nearby time points, correlated variables, and cross-feature patterns. Ablation results reported in the original study show that removing either temporal or feature attention hurts performance (Tashiro et al., 2021).

The base architecture is inspired by DiffWave. The reported implementation uses $K$ 1 diffusion steps, residual channels of 64, 4 residual layers, 8 attention heads, about 415k parameters, a sinusoidal time embedding of dimension 128, and a feature embedding of dimension 16. Variable-length sequences are handled by zero padding, and linear attention is used for forecasting settings with many features or long sequences.

The noise schedule is quadratic: $K$ 2 with $K$ 3 and $K$ 4. Training uses Adam, with batch size 16 for imputation, batch size 8 for forecasting, 200 epochs for imputation, and a learning rate that decays from $K$ 5 to $K$ 6 and $K$ 7.

CSDI addresses the absence of ground-truth missing values during training with a self-supervised masking strategy inspired by masked language modeling. An observed training sample is split into conditional observations $K$ 8 and artificial targets $K$ 9, and the model learns to reconstruct the artificially hidden part from the remaining observed part. Four target-choice strategies are described: random masking, historical masking using another sample’s missing pattern, a mix of random and historical masking, and use of the exact test missing pattern when it is known, as in forecasting.

4. Sampling procedure and empirical performance

At test time, CSDI imputes by fixing the observed values as $L$ 0, initializing the missing entries as

$L$ 1

and then iteratively sampling

$L$ 2

for $L$ 3. The final $L$ 4 is the imputed result. When a deterministic imputation is required, the reported procedure uses the median of many samples; the main experiments use 100 samples (Tashiro et al., 2021).

The original imputation experiments use two datasets. The PhysioNet 2012 ICU dataset contains 4000 clinical time series with 35 variables and 48 hourly time steps, with about 80% missingness; because no full ground truth is available, 10%, 50%, and 90% of observed values are hidden as test targets. The air-quality benchmark is Beijing PM2.5 data from 36 stations with 36 consecutive time steps per series and about 13% missingness. Probabilistic evaluation uses CRPS, and deterministic evaluation uses MAE, with RMSE reported in the appendix.

On probabilistic imputation, the reported CRPS results for healthcare are $L$ 5, $L$ 6, and $L$ 7 at 10%, 50%, and 90% missingness, respectively; GP-VAE yields $L$ 8, $L$ 9, and $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 0, and an unconditional diffusion baseline yields $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 1, $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 2, and $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 3. On air quality, CSDI attains CRPS $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 4, compared with $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 5 for GP-VAE and $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 6 for unconditional diffusion. The paper summarizes these results as roughly 40–65% CRPS improvement over probabilistic baselines.

On deterministic imputation using the median of 100 samples, the reported healthcare MAE values are $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 7, $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 8, and $\mathbf{M}^{time} \in \{0,1\}^{K \times L},$ 9 at 10%, 50%, and 90% missingness. BRITS yields $m_{k,l}=1$ 0, $m_{k,l}=1$ 1, and $m_{k,l}=1$ 2, while GLIMA reports $m_{k,l}=1$ 3 at 10% missingness. On air quality, CSDI reaches MAE $m_{k,l}=1$ 4, compared with $m_{k,l}=1$ 5 for BRITS and $m_{k,l}=1$ 6 for GLIMA. The reported deterministic gain is a 5–20% MAE reduction over the best deterministic methods.

The same framework also extends to interpolation and forecasting. On irregularly sampled healthcare data, CSDI outperforms Latent ODE and mTANs, with interpolation CRPS $m_{k,l}=1$ 7, $m_{k,l}=1$ 8, and $m_{k,l}=1$ 9 at 10%, 50%, and 90% missingness. On five forecasting datasets, it is described as competitive with GP-copula, Transformer MAF, TLAE, and TimeGrad, with especially strong results on some datasets such as electricity and traffic. These results are significant because they show that the method’s probabilistic outputs do not preclude strong deterministic point estimates; the same conditional sampling machinery supports both use cases.

5. Extensions, variants, and adjacent developments

Later work frequently treats CSDI as the canonical conditional diffusion approach for imputation and modifies either the conditioning signal, the domain-specific encoder, or the reverse-time sampler. The resulting systems are usually best understood as CSDI-style models rather than wholly unrelated diffusion formulations.

Development	Addition relative to CSDI	Reported implication
TabCSDI (Zheng et al., 2022)	Adapts CSDI to tabular missing-value imputation; removes the temporal transformer layer; studies one-hot encoding, analog bits encoding, and feature tokenization	Best RMSE on 5 out of 7 datasets; feature tokenization gives the best categorical performance on Census
CoFILL (He et al., 8 Jun 2025)	CSDI-style conditional diffusion for spatiotemporal data with a TCN+GCN temporal stream, a DCT frequency stream, and cross-attention fusion	Best performance in 12 out of 15 experimental configurations
LSCD (Fons et al., 20 Jun 2025)	Adds conditioning by a differentiable Lomb–Scargle spectrum and a spectral consistency loss	Improves time-domain accuracy and spectral recovery, especially under heavy missingness and periodic structure
LSSDM (Liang et al., 2024)	Adds a latent-variable reconstruction stage before conditional diffusion refinement	Best overall imputation performance on AQI-36, P12, and PeMS-BAY
MissDDIM (Zhou et al., 5 Aug 2025)	Replaces stochastic DDPM-style sampling with deterministic DDIM sampling on a TabCSDI-style tabular backbone	Lower inference latency and deterministic outputs
Diffusion Transformers for Imputation (Ye et al., 2 Oct 2025)	Replaces the score network with a transformer and provides sample-complexity and confidence-region theory	Makes missing-pattern sensitivity and uncertainty quantification explicit

The CSDI formulation has also been reused outside conventional imputation. In physics-informed vehicle speed trajectory generation, a transformer-based CSDI model is used as a conditional generator over univariate speed sequences with cross-attention or condition injection in each transformer layer and soft physics-informed losses. In that setting, CSDI achieves Wasserstein distance $m_{k,l}=0$ 0 for speed, $m_{k,l}=0$ 1 for acceleration, a discriminative score of $m_{k,l}=0$ 2, and 0% boundary violations, outperforming a 1D U-Net diffusion baseline on most reported metrics (Sokolov et al., 4 Feb 2026).

6. Limitations, interpretive issues, and later theoretical framing

The original CSDI paper identifies several practical limitations. Sampling is slower than deterministic imputation methods because diffusion requires many reverse steps. Performance depends on the noise schedule, the mask or target-choice strategy, and the match between training missingness and test missingness. The method is primarily demonstrated on time series, even though later work adapts its conditional principle to tabular and spatiotemporal settings (Tashiro et al., 2021).

Another interpretive issue is taxonomic rather than empirical: not every diffusion-based imputer is a direct CSDI variant. DiffPuter, for example, is framed as an unconditional diffusion model embedded in an EM loop. Its M-step trains a diffusion model on the current completed-data estimate, and its E-step performs conditional sampling by a forward-on-observed and reverse-on-missing procedure. The paper explicitly contrasts this with CSDI’s design as a conditional model by construction and argues that CSDI-like conditional diffusion methods can struggle when training data itself is incomplete (Zhang et al., 2024).

Later theoretical work also sharpens the role of missingness patterns. A transformer-based analysis of conditional diffusion imputation shows that statistical efficiency and confidence-region quality depend strongly on the missing pattern, with clustered missingness producing much worse conditional covariance conditioning than dispersed missingness. The same work proposes mixed-masking training to improve robustness across pattern types and presents this as a way to reduce distribution shift between training and test masks (Ye et al., 2 Oct 2025).

Taken together, these developments suggest that CSDI is best understood as a conditional diffusion paradigm rather than a single fixed architecture. Its defining commitments are explicit conditioning on observed values, diffusion restricted to the target region, self-supervised masking during training, and probabilistic imputation by iterative denoising. Subsequent research has largely preserved that core while altering the conditioning pathway, the representation space, the sampler, or the theoretical framing.