Papers
Topics
Authors
Recent
Search
2000 character limit reached

CSDI: Conditional Diffusion Imputation

Updated 4 July 2026
  • CSDI is a probabilistic imputation method that models the conditional distribution of missing entries using iterative reverse diffusion while keeping observed data fixed.
  • It employs a dual Transformer attention mechanism to capture temporal and feature dependencies, enhancing the accuracy of reconstructions from noisy initializations.
  • Empirical results demonstrate significant improvements in CRPS and MAE over traditional baselines on healthcare and air quality datasets using both probabilistic and deterministic outputs.

Conditional Score-Based Diffusion Imputation (CSDI) is a probabilistic imputation method for multivariate time series with missing values that directly models the conditional distribution of missing entries given observed entries, q(x0μx0o)q(\mathbf{x}_0^\mu \mid \mathbf{x}_0^o). Introduced in 2021 for time series imputation, and also applied to interpolation and probabilistic forecasting, it treats missing-value recovery as conditional reverse diffusion: the target region is initialized with noise and iteratively denoised while the observed values remain fixed as conditioning information throughout the generation process (Tashiro et al., 2021).

1. Problem setting and conceptual position

CSDI is formulated for multivariate time series

XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},

where KK is the number of features and LL is the number of time points. Missingness is represented by a mask

Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},

with mk,l=1m_{k,l}=1 if a value is observed and mk,l=0m_{k,l}=0 if it is missing. A sample is written as {X,Mtime,s}\{\mathbf{X}, \mathbf{M}^{time}, \mathbf{s}\}, where s\mathbf{s} denotes timestamps. Within this setup, CSDI targets three closely related tasks: standard imputation of arbitrary missing values, interpolation at irregularly sampled times, and forecasting of future time steps (Tashiro et al., 2021).

The method is positioned against several established imputation families. Autoregressive methods process time sequentially and often rely on RNN-like hidden-state dynamics; deterministic methods return a single imputed value; probabilistic latent-variable and Gaussian-process methods model uncertainty but may not exploit temporal-feature interactions as effectively; and naive unconditional diffusion approaches corrupt the observations themselves, weakening conditioning. CSDI is designed to avoid these limitations by explicitly learning the conditional reverse process for the missing region while leaving the observed region uncorrupted and available as context.

A recurrent misconception is that CSDI is simply a diffusion generator with masking added at input time. The original formulation is stricter: it learns the reverse dynamics of the missing part conditioned on the observed part, rather than approximating conditional imputation through an unconditional generative model. This distinction is central to its empirical behavior and to its later reuse as a canonical conditional diffusion backbone.

2. Conditional diffusion formulation

CSDI builds on DDPM-style diffusion. The forward process is

q(x1:Tx0):=t=1Tq(xtxt1),q(xtxt1)=N ⁣(1βtxt1,βtI),q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) := \prod_{t=1}^{T} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}), \quad q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\left(\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_t \mathbf{I}\right),

with equivalent closed form

XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},0

Hence,

XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},1

The unconditional reverse chain is

XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},2

with

XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},3

CSDI replaces this with a conditional reverse process defined only on the imputation target: XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},4 where

XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},5

The denoiser is therefore a conditional noise predictor

XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},6

and training uses the standard DDPM noise-prediction objective, restricted to the imputation target: XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},7

The conditioning mechanism is operationally simple but methodologically important. Missing values are initialized as noise, whereas observed values are supplied as conditioning input. The network sees the noisy target region XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},8, the observed context XRK×L,\mathbf{X} \in \mathbb{R}^{K \times L},9, and a mask KK0 indicating which positions are conditional observations. Observed values are not noised during sampling. This is what makes the method an imputation-specific conditional diffusion model rather than a generic unconditional diffusion model adapted post hoc (Tashiro et al., 2021).

3. Architecture, masking strategy, and implementation

A central architectural choice in CSDI is a 2D attention mechanism that models dependence along both axes of a multivariate time series. The denoiser uses a temporal Transformer layer that attends across time for each feature and a feature Transformer layer that attends across features for each time point. The rationale is explicit: imputation quality improves when a missing value can be inferred from nearby time points, correlated variables, and cross-feature patterns. Ablation results reported in the original study show that removing either temporal or feature attention hurts performance (Tashiro et al., 2021).

The base architecture is inspired by DiffWave. The reported implementation uses KK1 diffusion steps, residual channels of 64, 4 residual layers, 8 attention heads, about 415k parameters, a sinusoidal time embedding of dimension 128, and a feature embedding of dimension 16. Variable-length sequences are handled by zero padding, and linear attention is used for forecasting settings with many features or long sequences.

The noise schedule is quadratic: KK2 with KK3 and KK4. Training uses Adam, with batch size 16 for imputation, batch size 8 for forecasting, 200 epochs for imputation, and a learning rate that decays from KK5 to KK6 and KK7.

CSDI addresses the absence of ground-truth missing values during training with a self-supervised masking strategy inspired by masked language modeling. An observed training sample is split into conditional observations KK8 and artificial targets KK9, and the model learns to reconstruct the artificially hidden part from the remaining observed part. Four target-choice strategies are described: random masking, historical masking using another sample’s missing pattern, a mix of random and historical masking, and use of the exact test missing pattern when it is known, as in forecasting.

4. Sampling procedure and empirical performance

At test time, CSDI imputes by fixing the observed values as LL0, initializing the missing entries as

LL1

and then iteratively sampling

LL2

for LL3. The final LL4 is the imputed result. When a deterministic imputation is required, the reported procedure uses the median of many samples; the main experiments use 100 samples (Tashiro et al., 2021).

The original imputation experiments use two datasets. The PhysioNet 2012 ICU dataset contains 4000 clinical time series with 35 variables and 48 hourly time steps, with about 80% missingness; because no full ground truth is available, 10%, 50%, and 90% of observed values are hidden as test targets. The air-quality benchmark is Beijing PM2.5 data from 36 stations with 36 consecutive time steps per series and about 13% missingness. Probabilistic evaluation uses CRPS, and deterministic evaluation uses MAE, with RMSE reported in the appendix.

On probabilistic imputation, the reported CRPS results for healthcare are LL5, LL6, and LL7 at 10%, 50%, and 90% missingness, respectively; GP-VAE yields LL8, LL9, and Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},0, and an unconditional diffusion baseline yields Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},1, Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},2, and Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},3. On air quality, CSDI attains CRPS Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},4, compared with Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},5 for GP-VAE and Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},6 for unconditional diffusion. The paper summarizes these results as roughly 40–65% CRPS improvement over probabilistic baselines.

On deterministic imputation using the median of 100 samples, the reported healthcare MAE values are Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},7, Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},8, and Mtime{0,1}K×L,\mathbf{M}^{time} \in \{0,1\}^{K \times L},9 at 10%, 50%, and 90% missingness. BRITS yields mk,l=1m_{k,l}=10, mk,l=1m_{k,l}=11, and mk,l=1m_{k,l}=12, while GLIMA reports mk,l=1m_{k,l}=13 at 10% missingness. On air quality, CSDI reaches MAE mk,l=1m_{k,l}=14, compared with mk,l=1m_{k,l}=15 for BRITS and mk,l=1m_{k,l}=16 for GLIMA. The reported deterministic gain is a 5–20% MAE reduction over the best deterministic methods.

The same framework also extends to interpolation and forecasting. On irregularly sampled healthcare data, CSDI outperforms Latent ODE and mTANs, with interpolation CRPS mk,l=1m_{k,l}=17, mk,l=1m_{k,l}=18, and mk,l=1m_{k,l}=19 at 10%, 50%, and 90% missingness. On five forecasting datasets, it is described as competitive with GP-copula, Transformer MAF, TLAE, and TimeGrad, with especially strong results on some datasets such as electricity and traffic. These results are significant because they show that the method’s probabilistic outputs do not preclude strong deterministic point estimates; the same conditional sampling machinery supports both use cases.

5. Extensions, variants, and adjacent developments

Later work frequently treats CSDI as the canonical conditional diffusion approach for imputation and modifies either the conditioning signal, the domain-specific encoder, or the reverse-time sampler. The resulting systems are usually best understood as CSDI-style models rather than wholly unrelated diffusion formulations.

Development Addition relative to CSDI Reported implication
TabCSDI (Zheng et al., 2022) Adapts CSDI to tabular missing-value imputation; removes the temporal transformer layer; studies one-hot encoding, analog bits encoding, and feature tokenization Best RMSE on 5 out of 7 datasets; feature tokenization gives the best categorical performance on Census
CoFILL (He et al., 8 Jun 2025) CSDI-style conditional diffusion for spatiotemporal data with a TCN+GCN temporal stream, a DCT frequency stream, and cross-attention fusion Best performance in 12 out of 15 experimental configurations
LSCD (Fons et al., 20 Jun 2025) Adds conditioning by a differentiable Lomb–Scargle spectrum and a spectral consistency loss Improves time-domain accuracy and spectral recovery, especially under heavy missingness and periodic structure
LSSDM (Liang et al., 2024) Adds a latent-variable reconstruction stage before conditional diffusion refinement Best overall imputation performance on AQI-36, P12, and PeMS-BAY
MissDDIM (Zhou et al., 5 Aug 2025) Replaces stochastic DDPM-style sampling with deterministic DDIM sampling on a TabCSDI-style tabular backbone Lower inference latency and deterministic outputs
Diffusion Transformers for Imputation (Ye et al., 2 Oct 2025) Replaces the score network with a transformer and provides sample-complexity and confidence-region theory Makes missing-pattern sensitivity and uncertainty quantification explicit

The CSDI formulation has also been reused outside conventional imputation. In physics-informed vehicle speed trajectory generation, a transformer-based CSDI model is used as a conditional generator over univariate speed sequences with cross-attention or condition injection in each transformer layer and soft physics-informed losses. In that setting, CSDI achieves Wasserstein distance mk,l=0m_{k,l}=00 for speed, mk,l=0m_{k,l}=01 for acceleration, a discriminative score of mk,l=0m_{k,l}=02, and 0% boundary violations, outperforming a 1D U-Net diffusion baseline on most reported metrics (Sokolov et al., 4 Feb 2026).

6. Limitations, interpretive issues, and later theoretical framing

The original CSDI paper identifies several practical limitations. Sampling is slower than deterministic imputation methods because diffusion requires many reverse steps. Performance depends on the noise schedule, the mask or target-choice strategy, and the match between training missingness and test missingness. The method is primarily demonstrated on time series, even though later work adapts its conditional principle to tabular and spatiotemporal settings (Tashiro et al., 2021).

Another interpretive issue is taxonomic rather than empirical: not every diffusion-based imputer is a direct CSDI variant. DiffPuter, for example, is framed as an unconditional diffusion model embedded in an EM loop. Its M-step trains a diffusion model on the current completed-data estimate, and its E-step performs conditional sampling by a forward-on-observed and reverse-on-missing procedure. The paper explicitly contrasts this with CSDI’s design as a conditional model by construction and argues that CSDI-like conditional diffusion methods can struggle when training data itself is incomplete (Zhang et al., 2024).

Later theoretical work also sharpens the role of missingness patterns. A transformer-based analysis of conditional diffusion imputation shows that statistical efficiency and confidence-region quality depend strongly on the missing pattern, with clustered missingness producing much worse conditional covariance conditioning than dispersed missingness. The same work proposes mixed-masking training to improve robustness across pattern types and presents this as a way to reduce distribution shift between training and test masks (Ye et al., 2 Oct 2025).

Taken together, these developments suggest that CSDI is best understood as a conditional diffusion paradigm rather than a single fixed architecture. Its defining commitments are explicit conditioning on observed values, diffusion restricted to the target region, self-supervised masking during training, and probabilistic imputation by iterative denoising. Subsequent research has largely preserved that core while altering the conditioning pathway, the representation space, the sampler, or the theoretical framing.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Conditional Score-Based Diffusion Imputation (CSDI).