CtrTab: Diffusion Model for Tabular Data
- CtrTab is a diffusion-based generative model designed to synthesize high-dimensional, sparse tabular data using Laplace-noised control signals.
- It integrates a dual-branch architecture that fuses denoising and control modules to stabilize generation and mitigate the challenges of extreme data sparsity.
- CtrTab achieves over 80% reduction in downstream performance gaps while preserving privacy and accurately reproducing data structure.
CtrTab is a diffusion-based generative model for tabular data synthesis specifically designed to address the challenges posed by high-dimensional and severely data-sparse settings, where the number of features (D) far exceeds the number of samples (N), leading to . In this regime, common in domains such as genomics and healthcare, both classical statistical and modern deep generative models typically fail to produce synthetic tables that support effective downstream analysis. CtrTab introduces a unique condition-controlled diffusion architecture that injects Laplace-noised control signals to stabilize generation, regularize the model, and substantially reduce performance degradation associated with high dimensionality and limited data support (Li et al., 9 Mar 2025).
1. Challenges in High-Dimensional, Sparse Tabular Synthesis
Standard synthetic data generators, including both non-diffusion methods (statistical models like PrivBayes and DP-Synthesizer, as well as GAN-based models such as CTGAN and TVAE) and existing diffusion models (TabDDPM, TabSyn, RelDDPM), display significant limitations under extreme data sparsity. Statistical models impose strong independence assumptions that collapse in large D; GANs frequently suffer mode collapse or instability; and vanilla diffusion approaches degenerate as D increases, often performing worse than simple oversampling (e.g., SMOTE). In these settings, unconditional or classifier-guided diffusion models wander through , failing to remain anchored to the sparse, high-probability manifold indicated by scarce real data. The key missing component is structural control that guides the model toward regions consistent with the actual data distribution despite the paucity of samples.
2. CtrTab Architecture and Control Mechanism
CtrTab augments a standard diffusion denoiser with an explicit control module, introducing two parallel branches:
- The primary denoising network , implemented by a U-Net or TabSyn-style autoencoder, predicts the added Gaussian noise at each timestep.
- The control module encodes a "noisy" version of the original input sample, defined as where . This Laplace noise-perturbed sample acts as side information, steering the denoising process toward regions close to real data.
Both branches share a temporal embedding and fuse representations at two points: input fusion (adding time-embedded linear projections of and ) and control fusion (combining feature maps via zero-initialized convolutions). Conditioning on combined with Laplace noise is central, as it both enhances sample diversity and regularizes the model's response to noisy guidance.
3. Diffusion Model Formulation
CtrTab adopts the standard discrete-time DDPM formulation, with options for continuous-time SDE-based modeling. In the discrete-time setup:
- Forward process: ; equivalently, .
- Reverse process with control: .
- Training objective: , where , .
The core theoretical result is that training with zero-mean Laplace noise in of variance is equivalent to adding a Tikhonov (Lâ‚‚) regularization term on , penalizing sensitivity to the control signal and thus reducing overfitting.
4. Training and Sampling Procedures
CtrTab training involves two phases:
- Denosier Pretraining (optional):
- Train unconditionally until convergence and then freeze its weights.
- Control Module Training:
- For each batch, sample , , noise , and Laplace noise .
- Compute and as above.
- Pass through the main encoder, (plus ) through the control branch, fuse features, and predict .
- Minimize with respect to the control module only.
Pseudocode segments are given below:
1 2 3 4 5 6 7 8 9 10 |
for step in range(1, S+1): x0 = sample_training_data() t = sample_uniform(1, T) epsilon = sample_normal(0, 1) eta = sample_laplace(0, b) xt = sqrt(alpha_bar[t]) * x0 + sqrt(1 - alpha_bar[t]) * epsilon Cf = x0 + eta epsilon_hat = CtrTabDenoiser(xt, t, control=Cf) loss = ||epsilon - epsilon_hat||**2 backpropagate loss to update control module only |
1 2 3 4 5 6 7 |
xT = sample_normal(0, 1) choose Cf (e.g., randomly noisy seed) for t in reversed(range(1, T+1)): epsilon_hat = CtrTabDenoiser(xt, t, control=Cf) xt_minus_1 = (xt - beta_t / sqrt(1 - alpha_bar[t]) * epsilon_hat) / sqrt(alpha_t) + sqrt(beta_t) * z # where z ~ Normal(0, 1) return x0 |
The architecture builds on TabSyn’s score-based U-Net; AdamW optimizer settings, Laplace scale tuning, and training regimes are tailored to data characteristics. Laplace scale is typically set in per dataset.
5. Experimental Results and Comparative Analysis
CtrTab was evaluated on seven benchmark datasets with feature dimensionality ranging from 32 to 241 and between 2.6K and 7.9K training samples. Data sparsity ranged from to . Downstream performance was assessed using XGBoost classifiers/regressors trained on synthetic tables, with metrics including AUC, F1 for classification, and RMSE, for regression.
A summary of comparative findings:
| Model | Avg. Classification Gap | Avg. Regression Gap | Privacy (DCR) |
|---|---|---|---|
| SMOTE | ~3% | Hundreds of % | - |
| TabSyn | ~15% | Hundreds of % | - |
| CtrTab | <1% | ~0.5% | ~0.47 |
CtrTab reduces the relative accuracy gap to real data by over 80% compared to best prior baselines. Privacy, as measured by normalized Distance-to-Closest-Record (DCR), approaches 0.47 (ideal is 0.50), indicating no excessive memorization. Ablation studies show that simple data augmentation, larger models, or doubled training epochs provide only marginal gains compared to the dedicated control module. Laplace noise delivers superior data utility compared to Gaussian or uniform alternatives at comparable DCR levels. Performance is robust across hyperparameter sweeps for Laplace scale up to a threshold, after which degradation occurs. Attaching the control module to RelDDPM confirms the approach’s transferability.
Qualitative analysis (e.g., heat maps of column correlations) demonstrates that CtrTab more accurately reproduces structural patterns present in the original datasets.
6. Theoretical Properties and Regularization Effects
The addition of Laplace-noised control signals is analytically shown to be equivalent to incorporating an explicit Lâ‚‚ regularization term on the denoising network's Jacobian with respect to the control input, as described by:
This interpretation clarifies how CtrTab discourages overfitting and enhances robustness to spurious variations in high-dimensional data, a property supported by both theoretical proof and empirical results.
7. Limitations, Generalizations, and Future Directions
Potential limitations of CtrTab include the requirement for task-specific tuning of the Laplace scale and increased computational load due to the control branch. Prospective research directions include dynamic or adaptive control noise schemes, alternative forms of regularization (such as adversarial perturbations), or integration of domain-specific side information, such as enforcing logical constraints. The explicit control-module construction has potential extensibility to other modalities, including image, time-series, or graph diffusion models, particularly in settings where available data are limited but some guiding signal is accessible (Li et al., 9 Mar 2025).