Tabular Denoising Diffusion Models

Updated 26 January 2026

Tabular Denoising Diffusion Probability Models are generative techniques that employ diffusion processes to synthesize, complete, and impute heterogeneous tabular datasets using both continuous and categorical noise schedules.
They utilize a dual Markov chain framework with tailored neural denoisers and dedicated noise processes, outperforming traditional GANs, VAEs, and imputation techniques in fidelity and privacy.
These models support synthetic generation, conditional imputation, class-conditional augmentation, and federated training, demonstrating strong performance on privacy, utility, and statistical similarity metrics.

Tabular Denoising Diffusion Probability Models (TabDDPM) constitute a class of generative models that leverage the denoising diffusion probabilistic framework to synthesize, complete, and impute heterogeneous tabular data. Originally adapted from diffusion models in image domains, TabDDPM extends these principles to mixed-type tabular data—handling both continuous and categorical features through dedicated noise processes and neural denoisers. Across a range of recent works, the paradigm has demonstrated superior sample fidelity, utility, and flexibility relative to GANs, VAEs, and traditional oversampling or imputation techniques, particularly in privacy-sensitive and imbalanced data regimes.

1. Mathematical Foundations: Diffusion Processes for Mixed-Type Tabular Data

TabDDPM formalizes the data-generating process as a pair of Markov chains: a fixed forward noising process and a trainable reverse denoising process, applied independently or jointly to heterogeneous tabular features.

For continuous features, the forward corruption at timestep $t$ is defined as:

$q(x_t^{\mathrm{num}} \mid x_{t-1}^{\mathrm{num}}) = \mathcal{N}\left(x_t^{\mathrm{num}}; \sqrt{1-\beta_t}\,x_{t-1}^{\mathrm{num}},\, \beta_t I\right)$

where $\beta_t$ is the diffusion noise schedule, typically linearly increasing between $10^{-4}$ and $2\times 10^{-2}$ over $T=500$ or $T=1000$ steps (Kotelnikov et al., 2022, Sattarov et al., 2024). Closed-form marginalization yields:

$q(x_t^{\mathrm{num}} \mid x_0^{\mathrm{num}}) = \mathcal{N}\left(x_t^{\mathrm{num}}; \sqrt{\bar{\alpha}_t} x_0^{\mathrm{num}}, (1-\bar{\alpha}_t)I \right)$

with $\bar{\alpha}_t = \prod_{s=1}^t (1-\beta_s)$ .

For each categorical variable (one-hot encoding of $K$ categories):

$q(x_t^{(i)} \mid x_{t-1}^{(i)}) = \mathrm{Cat}\Bigl( x_t^{(i)};\, (1-\beta_t) x_{t-1}^{(i)} + \frac{\beta_t}{K}\mathbf{1} \Bigr)$

so that, at each step, a portion of mass is diffused towards the uniform distribution over categories (Kotelnikov et al., 2022, Ceritli et al., 2023).

The reverse generative process is parameterized for numericals as:

$p_\theta(x_{t-1}^{\mathrm{num}} \mid x_t^{\mathrm{num}}) = \mathcal{N} \left( x_{t-1}^{\mathrm{num}}; \mu_\theta(x_t^{\mathrm{num}}, t),\, \beta_t I \right)$

with mean

$\mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\, \epsilon_\theta(x_t, t) \right)$

where $\epsilon_\theta$ is a neural network trained to predict the added noise.

For categoricals, the model learns to predict either a clean one-hot vector or logits $h_\theta(x_t, t)$ , which are passed to a softmax to produce category probabilities for the reverse step:

$p_\theta(x_{t-1}^{(i)} \mid x_t^{(i)}) = \mathrm{Cat}(x_{t-1}^{(i)};\, \mathrm{softmax}(h_\theta^{(i)}(x_t, t)))$

(Kotelnikov et al., 2022, Ceritli et al., 2023, Villaizán-Vallelado et al., 2024).

Unified loss for both feature types aggregates the mean-squared error (MSE) for numericals with Kullback–Leibler (KL) divergence between true and predicted distributions for categoricals.

2. Architectures for Tabular Denoising Networks

TabDDPM fundamentally employs feedforward multi-layer perceptrons (MLPs) to parameterize the denoising network, with input construction reflecting the heterogeneity of tabular columns.

MLP Architecture: Typical implementations use 2–4 hidden layers, with 128–1024 units per layer and nonlinearity (ReLU, SiLU) (Kotelnikov et al., 2022, Sattarov et al., 2024). Time-step embedding (e.g., sinusoidal) is concatenated or summed to the feature vector. Categorical columns are embedded as trainable low-dimensional vectors or one-hot encoded.
Advanced Architectures:
- SEMST-ResNet in SEMRes-DDPM (Zheng et al., 2024) introduces stacked residual blocks combining fully connected layers, multi-head self-attention, and dynamic soft-thresholding, improving denoising efficacy over standard MLPs (peak SNR improvements: 12.1 dB with SEMST-ResNet vs 8.7 dB with MLP).
- Transformer-based Denoisers: Recent extensions parameterize the denoising network as a transformer or encoder–decoder transformer, capturing inter-feature dependencies and supporting dynamic masking and conditioning (Leung et al., 2024, Villaizán-Vallelado et al., 2024, Wen et al., 2024).
- Harmonization and Specialized Heads: Certain imputation frameworks (e.g., DiffImpute) leverage harmonization loops and separate denoising heads per feature, supporting batchwise dynamic masking and accelerated sampling (Wen et al., 2024).
Categorical Handling: Native multinomial diffusion processes for one-hot categorical features are preferred over integer encoding or post-hoc rounding.

3. Training Procedure and Losses

Training optimizes a variational lower bound (ELBO) or, more commonly, a “simple” denoising score-matching loss:

$L_t = \mathbb{E}_{x_0, \epsilon, t}\left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]$

for numeric variables, and for categoricals:

$L_t^{(i)} = \mathbb{E}_{x_0, t}\left[ \mathrm{KL}\big(q(x_{t-1}^{(i)} | x_t^{(i)}, x_0^{(i)})\, \|\, p_\theta(x_{t-1}^{(i)} | x_t^{(i)}) \big) \right]$

(Kotelnikov et al., 2022, Ceritli et al., 2023, Sattarov et al., 2024). Timesteps $t$ are sampled uniformly, and networks are trained via Adam or AdamW optimizers.

The total loss is summed across timesteps and feature types, optionally normalized by the number of features/columns.

4. Sampling, Conditioning, and Unified Generation-Imputation

TabDDPM sampling is a Markov chain from pure noise to data, with forward and reverse steps alternating between numerics and categoricals. Recent models support various conditioning and unified workflows:

Synthetic Generation: All features are noised and denoised, producing fully synthetic data (Kotelnikov et al., 2022, Ceritli et al., 2023).
Conditional Density Estimation: Transformer-based TabDDPM variants can provide $p(x_0 | \{x_i\})$ for arbitrary context sets, supporting scientific inference and uncertainty estimation (Leung et al., 2024). Conditional tokens and dynamic attention allow flexible queries and missing data handling.
Imputation: By masking only missing values during the forward process and conditioning on observed cells, models impute missing entries in a manner consistent with the observed distribution (Wen et al., 2024, Villaizán-Vallelado et al., 2024).
Oversampling: In imbalanced-class settings, per-class TabDDPMs generate minority-class samples for data augmentation, resulting in improved downstream recall and F1 for rare categories (B et al., 19 Jan 2026, Zheng et al., 2024).
Federated Training: FedTabDiff (Sattarov et al., 2024) orchestrates synchronous model updates and federated averaging across multiple local clients, enabling synthetic data generation without centralizing sensitive data.

Pseudocode for the reverse generation (sampling) chain follows the alternating schedule of denoising numerics by $\mu_\theta$ and sampling categoricals via the network’s predicted logits, as formalized in (Kotelnikov et al., 2022, Ceritli et al., 2023).

5. Evaluation Metrics and Empirical Performance

TabDDPM models are evaluated on fidelity, utility, privacy, and coverage:

Metric	Description	Example Results
Fidelity Ω	Row- and column-wise similarity (KS for numerics, TVD for categoricals)	Philadelphia: Ω=0.590 (FedTabDiff)
Utility Φ	Downstream classifier accuracy, trained on synthetic, tested on real holdout	Philadelphia: Φ=0.837 (FedTabDiff)
Coverage γ	Fraction of real categories covered, alignment of numeric ranges	Philadelphia: γ=0.944 (FedTabDiff)
Privacy DCR	Median distance to nearest real neighbor (higher is safer)	Diabetes: DCR=3.120 (FedTabDiff, better privacy)

On medical, financial, and cybersecurity tabular benchmarks, TabDDPM outperforms VAEs, GANs (e.g., medGAN, CTGAN), and SMOTE on fidelity, downstream ML efficiency, and often on privacy (DCR), though high fidelity can reduce DCR by producing records close to real observations (Ceritli et al., 2023, Kotelnikov et al., 2022, Sattarov et al., 2024, B et al., 19 Jan 2026). Augmentation and imputation experiments show transformer- and SEMST-ResNet–based TabDDPM variants further improve statistical similarity and downstream ML metrics, with observed average F1, G-mean, and AUC surpassing baselines in imbalanced datasets (Zheng et al., 2024, Wen et al., 2024, Villaizán-Vallelado et al., 2024).

6. Privacy, Federated Learning, and Regularization

Privacy in TabDDPM is addressed empirically and structurally:

Empirical Privacy: Privacy is quantified post hoc by DCR; TabDDPM generally achieves higher DCR than interpolation and baseline generative models. Membership inference risk (MIR) also evaluated in some works (Ceritli et al., 2023).
Federated TabDDPM: FedTabDiff extends the TabDDPM protocol using a client-server federated learning architecture, aggregating model weights (not data) via weighted FedAvg update over R=1,000 communication rounds (Sattarov et al., 2024).
Locality Guarantees: No raw tabular records leave the client, preventing direct privacy violations.
Differential Privacy Extensions: Differentially private gradient perturbation, or additional output regularization, is suggested as a future enhancement but has not been integral to core published TabDDPMs (Kotelnikov et al., 2022, Ceritli et al., 2023).

7. Extensions, Limitations, and Future Directions

Key extensions to the TabDDPM framework include:

Transformer-based Denoisers: Transformer modules as the denoising network (embedding, attention, encoder–decoder transformers) improve the capture of feature dependencies and unify imputation and generation tasks via dynamic masking and conditioning attention (Leung et al., 2024, Villaizán-Vallelado et al., 2024, Wen et al., 2024).
Class-Conditional Generation: TabDDPM supports explicit class conditioning, improving minority-class generation and balanced augmentation for imbalanced learning (Ceritli et al., 2023, B et al., 19 Jan 2026).
Architectural Enhancements: SEMST-ResNet architectures deliver improved noise removal and synthetic sample quality in imbalanced tasks (Zheng et al., 2024).
Unified Imputation/Generation: Dynamic masking in the transformer attention framework enables seamless switching between imputation and data synthesis modalities (Villaizán-Vallelado et al., 2024).
Limitations: Sampling is computationally intensive (hundreds to thousands of steps). Direct likelihood evaluation is in general intractable, with calibration assessed via PIT and distributional tests rather than explicit log-likelihoods (Leung et al., 2024). No TabDDPM to date implements formal DP guarantees by default.

Open directions include accelerated samplers (e.g., DDIM), exploiting advanced feature correlations (cross-attention, prior knowledge), integration of rigorous differential privacy during training, and explicit support for ordinal and hierarchical discrete features.

Tabular Denoising Diffusion Probability Models constitute a rapidly-evolving, empirically validated paradigm for modeling, completing, and synthesizing high-dimensional, mixed-type tabular datasets. Their design—combining robust diffusion processes, universal denoising architectures, and flexibility in setting—positions TabDDPM at the forefront of state-of-the-art tabular generative modeling (Kotelnikov et al., 2022, Ceritli et al., 2023, Sattarov et al., 2024, Zheng et al., 2024, Leung et al., 2024, Villaizán-Vallelado et al., 2024, Wen et al., 2024, B et al., 19 Jan 2026).