CtrTab: Diffusion Model for Tabular Data

Updated 27 December 2025

CtrTab is a diffusion-based generative model designed to synthesize high-dimensional, sparse tabular data using Laplace-noised control signals.
It integrates a dual-branch architecture that fuses denoising and control modules to stabilize generation and mitigate the challenges of extreme data sparsity.
CtrTab achieves over 80% reduction in downstream performance gaps while preserving privacy and accurately reproducing data structure.

CtrTab is a diffusion-based generative model for tabular data synthesis specifically designed to address the challenges posed by high-dimensional and severely data-sparse settings, where the number of features (D) far exceeds the number of samples (N), leading to $N \ll 2^D$ . In this regime, common in domains such as genomics and healthcare, both classical statistical and modern deep generative models typically fail to produce synthetic tables that support effective downstream analysis. CtrTab introduces a unique condition-controlled diffusion architecture that injects Laplace-noised control signals to stabilize generation, regularize the model, and substantially reduce performance degradation associated with high dimensionality and limited data support (Li et al., 9 Mar 2025).

1. Challenges in High-Dimensional, Sparse Tabular Synthesis

Standard synthetic data generators, including both non-diffusion methods (statistical models like PrivBayes and DP-Synthesizer, as well as GAN-based models such as CTGAN and TVAE) and existing diffusion models (TabDDPM, TabSyn, RelDDPM), display significant limitations under extreme data sparsity. Statistical models impose strong independence assumptions that collapse in large D; GANs frequently suffer mode collapse or instability; and vanilla diffusion approaches degenerate as D increases, often performing worse than simple oversampling (e.g., SMOTE). In these settings, unconditional or classifier-guided diffusion models wander through $\mathbb{R}^D$ , failing to remain anchored to the sparse, high-probability manifold indicated by scarce real data. The key missing component is structural control that guides the model toward regions consistent with the actual data distribution despite the paucity of samples.

2. CtrTab Architecture and Control Mechanism

CtrTab augments a standard diffusion denoiser with an explicit control module, introducing two parallel branches:

The primary denoising network $\epsilon_\theta(x_t, t)$ , implemented by a U-Net or TabSyn-style autoencoder, predicts the added Gaussian noise at each timestep.
The control module encodes a "noisy" version of the original input sample, defined as $C_f \coloneqq x_0 + \eta$ where $\eta \sim Laplace(0, b)$ . This Laplace noise-perturbed sample acts as side information, steering the denoising process toward regions close to real data.

Both branches share a temporal embedding and fuse representations at two points: input fusion (adding time-embedded linear projections of $x_t$ and $C_f$ ) and control fusion (combining feature maps via zero-initialized convolutions). Conditioning on $C_f$ combined with Laplace noise is central, as it both enhances sample diversity and regularizes the model's response to noisy guidance.

3. Diffusion Model Formulation

CtrTab adopts the standard discrete-time DDPM formulation, with options for continuous-time SDE-based modeling. In the discrete-time setup:

Forward process: $q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ ; equivalently, $q(x_t|x_0)=\mathcal{N}(x_t;\sqrt{\bar{\alpha}_t}x_0,(1-\bar{\alpha}_t)I)$ .
Reverse process with control: $p_\theta(x_{t-1}|x_t,C_f) = \mathcal{N}(x_{t-1};\mu_\theta(x_t, C_f, t), \Sigma_\theta(t))$ .
Training objective: $L(\theta) = \mathbb{E}\left[\|\epsilon - \epsilon_\theta(x_t, t, C_f)\|_2^2\right]$ , where $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$ , $C_f = x_0 + \eta$ .

The core theoretical result is that training with zero-mean Laplace noise in $C_f$ of variance $\eta^2$ is equivalent to adding a Tikhonov (L₂) regularization term on $\sum_i \|\partial \epsilon_\theta / \partial C_{f,i}\|^2$ , penalizing sensitivity to the control signal and thus reducing overfitting.

4. Training and Sampling Procedures

CtrTab training involves two phases:

Denosier Pretraining (optional):
- Train $\epsilon_\theta(x_t, t)$ unconditionally until convergence and then freeze its weights.
Control Module Training:
- For each batch, sample $x_0$ , $t$ , noise $\epsilon \sim \mathcal{N}(0,I)$ , and Laplace noise $\eta \sim Laplace(0,b)$ .
- Compute $x_t$ and $C_f$ as above.
- Pass $x_t$ through the main encoder, $C_f$ (plus $x_t$ ) through the control branch, fuse features, and predict $\hat{\epsilon}$ .
- Minimize $L = \|\epsilon - \hat{\epsilon}\|^2$ with respect to the control module only.

Pseudocode segments are given below:

for step in range(1, S+1):
    x0 = sample_training_data()
    t = sample_uniform(1, T)
    epsilon = sample_normal(0, 1)
    eta = sample_laplace(0, b)
    xt = sqrt(alpha_bar[t]) * x0 + sqrt(1 - alpha_bar[t]) * epsilon
    Cf = x0 + eta
    epsilon_hat = CtrTabDenoiser(xt, t, control=Cf)
    loss = ||epsilon - epsilon_hat||**2
    backpropagate loss to update control module only

xT = sample_normal(0, 1)
choose Cf (e.g., randomly noisy seed)
for t in reversed(range(1, T+1)):
    epsilon_hat = CtrTabDenoiser(xt, t, control=Cf)
    xt_minus_1 = (xt - beta_t / sqrt(1 - alpha_bar[t]) * epsilon_hat) / sqrt(alpha_t) + sqrt(beta_t) * z
    # where z ~ Normal(0, 1)
return x0

The architecture builds on TabSyn’s score-based U-Net; AdamW optimizer settings, Laplace scale tuning, and training regimes are tailored to data characteristics. Laplace scale $b$ is typically set in $[0.005, 0.05]$ per dataset.

5. Experimental Results and Comparative Analysis

CtrTab was evaluated on seven benchmark datasets with feature dimensionality ranging from 32 to 241 and between 2.6K and 7.9K training samples. Data sparsity ranged from $N/2^D \sim 10^{-14}$ to $10^{-70}$ . Downstream performance was assessed using XGBoost classifiers/regressors trained on synthetic tables, with metrics including AUC, F1 for classification, and RMSE, $R^2$ for regression.

A summary of comparative findings:

Model	Avg. Classification Gap	Avg. Regression Gap	Privacy (DCR)
SMOTE	~3%	Hundreds of %	-
TabSyn	~15%	Hundreds of %	-
CtrTab	<1%	~0.5%	~0.47

CtrTab reduces the relative accuracy gap to real data by over 80% compared to best prior baselines. Privacy, as measured by normalized Distance-to-Closest-Record (DCR), approaches 0.47 (ideal is 0.50), indicating no excessive memorization. Ablation studies show that simple data augmentation, larger models, or doubled training epochs provide only marginal gains compared to the dedicated control module. Laplace noise delivers superior data utility compared to Gaussian or uniform alternatives at comparable DCR levels. Performance is robust across hyperparameter sweeps for Laplace scale $b$ up to a threshold, after which degradation occurs. Attaching the control module to RelDDPM confirms the approach’s transferability.

Qualitative analysis (e.g., heat maps of column correlations) demonstrates that CtrTab more accurately reproduces structural patterns present in the original datasets.

6. Theoretical Properties and Regularization Effects

The addition of Laplace-noised control signals is analytically shown to be equivalent to incorporating an explicit L₂ regularization term on the denoising network's Jacobian with respect to the control input, as described by:

$\eta^2 \sum_i \left\|\frac{\partial \epsilon_\theta}{\partial C_{f,i}}\right\|^2$

This interpretation clarifies how CtrTab discourages overfitting and enhances robustness to spurious variations in high-dimensional data, a property supported by both theoretical proof and empirical results.

7. Limitations, Generalizations, and Future Directions

Potential limitations of CtrTab include the requirement for task-specific tuning of the Laplace scale $b$ and increased computational load due to the control branch. Prospective research directions include dynamic or adaptive control noise schemes, alternative forms of regularization (such as adversarial perturbations), or integration of domain-specific side information, such as enforcing logical constraints. The explicit control-module construction has potential extensibility to other modalities, including image, time-series, or graph diffusion models, particularly in settings where available data are limited but some guiding signal is accessible (Li et al., 9 Mar 2025).

Markdown Upgrade to Chat

References (1)

CtrTab: Tabular Data Synthesis with High-Dimensional and Limited Data (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CtrTab.