Differentially Private Synthetic Data

Updated 5 December 2025

Differentially private synthetic data is a set of techniques that generate artificial datasets mimicking original data properties while enforcing strict privacy via parameters like ε and δ.
Recent methodologies include marginal-based approaches, GANs, and optimization techniques that integrate noise to preserve statistical utility.
Empirical evaluations highlight trade-offs among privacy, accuracy, and fairness, guiding effective deployment in high-dimensional and relational data scenarios.

Differentially private synthetic data refers to algorithms and mechanisms for generating artificial datasets that emulate the statistical properties of the original, sensitive dataset while guaranteeing formal privacy protections under the differential privacy (DP) framework. These mechanisms enable analysts to perform arbitrary downstream processing and modeling without direct exposure to the original data, ensuring that no individual’s data can be reverse-engineered or re-identified beyond a quantifiable risk threshold defined by parameters ε (privacy budget) and δ (failure probability). Differentially private synthetic data methods span a range of algorithmic paradigms, including output perturbation, graphical-model postprocessing, generative adversarial networks, adaptive selection of measurement queries, and optimization- or sampling-based approaches. Recent research delineates advances in distributional fidelity, high-dimensional scalability, statistical inference guarantees, fairness, and practical deployment.

1. Foundational Principles of Differential Privacy in Synthetic Data

Differential privacy formalizes privacy risk by ensuring that the output distribution of any release mechanism, including those generating synthetic datasets, changes only minimally with the inclusion or removal of any single individual from the input data. An algorithm 𝑀 is said to be (ε,δ)-differentially private if, for all neighboring datasets D and D′ (differing in one record) and all output sets S, the bound

$\Pr[𝑀(D)\in S] \le e^{\epsilon}\,\Pr[𝑀(D')\in S] + \delta$

holds (Snoke et al., 2018).

Synthetic data generation under DP is more challenging than mere query release, as it must capture the multivariate dependence structure of the data while propagating privacy-preserving noise or randomization throughout the synthesized records. Achieving high analytical utility under these constraints is a core driver of advances in this area.

Key features distinguishing differentially private synthetic data mechanisms include:

Global Indistinguishability: The realized synthetic dataset, considered as a whole, must encode the same privacy assurances as single-query mechanisms.
Post-processing Immunity: Any function or transformation applied to the synthetic data cannot worsen the privacy guarantee.
Trade-off Navigation: There is an inherent trade-off between privacy (tighter ε, δ) and statistical fidelity or utility.

2. Algorithmic Methodologies

Differentially private synthetic data methods are implemented along several broad paradigms:

2.1 Marginal-based/Graphical Model Approaches

These mechanisms focus on preserving a collection of low-dimensional marginals, from which a graphical model or multiway histogram is fitted. The general approach, as instantiated in winning mechanisms of the NIST Differential Privacy Synthetic Data Challenge, proceeds through:

Selection of Marginals: Utilizing domain-knowledge or private data (with Exponential Mechanism) to identify which k-way marginals to preserve.
Private Measurement: Applying the Laplace or Gaussian mechanism to each selected marginal, allocating privacy budget via (sequential or parallel) composition rules.
Reconstruction: Solving maximum-likelihood or maximum-entropy optimization to find a graphical model whose marginals best match those noisy measurements. Synthetic records are then sampled from this constructed model (McKenna et al., 2021, Bowen et al., 2019, Zhang et al., 2020).

This class includes methods such as MWEM PGM, MST (Maximum Spanning Tree-based selection), PrivBayes (privatized Bayesian network), and PrivSyn (automatic, dense selection and GUM-based dataset massaging).

2.2 Exponential Mechanism with Distributional Utility

Mechanisms such as the pMSE Mechanism directly encode a distributional similarity metric—specifically the propensity-mean-squared-error (pMSE), quantifying distinguishability of synthetic from original data by a classifier—into the Exponential Mechanism’s utility function. This yields synthetic datasets that, subject to DP, are as indistinguishable from the original data as possible in the specified metric (Snoke et al., 2018). The workflow involves:

Calculating expected pMSE for candidate parameters of a generative model,
Calibrating the sensitivity for privacy costs,
Sampling parameter settings using the Exponential Mechanism,
Synthesizing datasets via draws from the selected parameterization.

2.3 Sampling and Optimization-based Constructions

Recent frameworks approach DP synthetic data by converting the synthetic data generation task into an optimization over a reduced (e.g., subsampled or histogram) domain. Mechanisms of this flavor proceed by:

Subsampling a finite support (Ω*) from a public or privately estimated distribution,
Reweighting to match privatized empirical statistics (linear queries),
Solving a finite-dimensional LP to find a matching distribution,
Sampling synthetic data from this candidate (Boedihardjo et al., 2021, Boedihardjo et al., 2021, Bojkovic et al., 6 May 2024).

Variants handle both discrete and continuous data domains, with density estimation (histogram or kernel-based) performed in a private manner upstream of the optimization (Bojkovic et al., 6 May 2024).

2.4 Partitioning and Space-Partition-based Methods

Hierarchical space partitioning (e.g., private binary trees, KD-trees) is applied to continuous or high-dimensional data. Counts in hierarchical partitions are privatized, and synthetic points are sampled accordingly, achieving near-optimal accuracy rates under the 1-Wasserstein distance (He et al., 2023, Kreačić et al., 2023).

2.5 Generative Adversarial and Deep Generative Models

Differential privacy is enforced through gradient perturbation (DP-SGD) or teacher aggregation (PATE) on the discriminator of a GAN. The generator produces synthetic data, while the DP guarantee is tracked using privacy accountants. Conditional models (DP-CGAN, DP-CTGAN) and ensemble augmentations (QUAIL) also appear (Rosenblatt et al., 2020, Torkzadehmahani et al., 2020).

2.6 Focused Synthesis for Risk Control

Mechanisms like ε-PrivateSMOTE restrict privacy-preserving synthetic generation to high-reidentification risk cases (e.g., quasi-identifier singleton or rare pairs), synthesizing only at-risk rows via neighbor-based Laplace-noised interpolation. The remainder of the dataset remains intact, ensuring high utility while efficiently reducing linkage risk (Carvalho et al., 2022).

3. Statistical Guarantees and Empirical Utility

Mechanisms are typically assessed on:

Distributional Fidelity: Measured by metrics such as pMSE, MMD, or ℓ₁ error of k-way marginals.
Downstream Predictive Utility: TSTR (Train Synthetic, Test Real) AUC or F1 for supervised learning.
Statistical Inference Validity: Confidence interval overlap, Type I/II error calibration for hypothesis tests on synthetic data (Perez et al., 20 Mar 2024).
Fairness Preservation: Quantification of demographic parity, equalized odds, and subgroup accuracy gaps in downstream models (Bullwinkel et al., 2022, Pereira et al., 2023).

Empirical studies demonstrate that marginal-based mechanisms (Private-PGM, MST, MWEM PGM) routinely outperform GAN-based methods in tabular data for both utility and fairness, provided that sufficiently large ε (e.g., ε≥5) is allocated (Pereira et al., 2023).

For synthetic data–based statistical testing, many methods with strong privacy (ε≤1–2) yield severely inflated false positive (Type I error) rates unless specialized smoothing or noise-aware methods are adopted (Perez et al., 20 Mar 2024).

4. Practical Considerations, Limitations, and Extensions

4.1 Privacy-Utility Trade-offs and Parameter Selection

The privacy parameter ε is the dominant lever. Empirical guidance suggests ε in [1,5] yields practical utility with moderate privacy protection for many mechanisms (Rosenblatt et al., 2020, Bowen et al., 2019, McKenna et al., 2021). Overly stringent ε (<0.5) can lead to synthetic data with poor inferential properties or fairness distortions.

4.2 High-dimensionality and Scalability

Synthetic data mechanisms based on exhaustive marginal release or flat histograms do not scale to high dimensions due to exponential explosion in cell/parameter counts. Graphical model–based selection, hierarchical partitioning, and dimensionality reduction (e.g., DP-PCA followed by synthetic data generation in principal subspaces) are critical for tractable high-dimensional synthesis (Zhang et al., 2020, He et al., 2023, He et al., 2023).

4.3 Fairness and Minority Subgroup Representation

DP mechanisms—especially those that perturb or suppress rare group counts—can exacerbate fairness disparities in downstream models. Empirical rebalancing via multi-label undersampling prior to synthesis can restore parity without significant loss in accuracy (Bullwinkel et al., 2022). Marginal-based approaches demonstrate more robust fairness preservation compared to GAN-based methods.

4.4 Synthetic Data for Relational and Vertically Split Data

Emerging work addresses the synthesis of relational databases by iterative refinement of referential integrity links across tables via differentially private projection and rounding, ensuring referential integrity and marginal fidelity under joint DP composition (Alimohammadi et al., 29 May 2024). For vertically partitioned public-private features, conditional graphical-model generation given observed public columns yields the best utility, outperforming pretraining or simple public-assisted adaptations (Maddock et al., 15 Apr 2025).

5. Comparative Evaluations and Benchmarks

Extensive empirical and simulated studies benchmarked leading mechanisms on tabular, continuous, location, medical, and relational datasets, using utility and fairness metrics. Major findings include:

Private-PGM, MST, and PrivSyn deliver superior accuracy on marginals and downstream modeling tasks in moderate and high dimensions (McKenna et al., 2021, Zhang et al., 2020).
GAN-based synthesizers (DP-GAN, PATE-GAN, etc.) yield inferior utility and fairness for tabular data at moderate privacy levels, though may be competitive on imaging tasks (Bowen et al., 2019, Torkzadehmahani et al., 2020, Rosenblatt et al., 2020).
Type I statistical error control under DP is nontrivial: direct test privatization (e.g., DP Mann-Whitney U) or specialized smoothed histogram mechanisms are required for reliable inferential validity at low ε (Perez et al., 20 Mar 2024).
Conditioned or focused synthetic data generation (e.g., only modifying high-risk records) can efficiently balance privacy risk and utility (Carvalho et al., 2022).

6. Open Challenges and Future Directions

Outstanding directions for research and deployment include:

Scalability: Tightening utility bounds and developing computationally efficient mechanisms for high-dimensional or large-scale relational data (He et al., 2023, Alimohammadi et al., 29 May 2024).
Public Data Integration: Leveraging vertical public-private splits for conditional generation and developing joint training objectives to exploit public columns (Maddock et al., 15 Apr 2025).
Model Selection under DP: Algorithms for private feature selection, model averaging, and adaptive workload selection to maximize utility within privacy constraints (Zhang et al., 2020, McKenna et al., 2021).
Statistical Inference: Certification of inferential validity and coverage for general classes of analyses on synthetic data, beyond first-order moment preservation (Perez et al., 20 Mar 2024).
Fairness and Representation: Robust methods that maintain accurate subgroup representation, especially for minority and intersectional populations (Bullwinkel et al., 2022, Pereira et al., 2023).
Hybrid and Domain-specific Techniques: Adaptive combination of parametric, nonparametric, deep generative, and optimization-based synthesis tailored for structured, longitudinal, or spatial data (Cunningham et al., 2021, Kreačić et al., 2023).

7. Summary Table: Core Mechanisms and Their Properties

Mechanism Type	Example Algorithms	High-dimensional scalability	Downstream fairness	Utility guarantee type
Marginal-based	Private-PGM, MST, MWEM	High (with treewidth limits)	Robust	Marginal/graphical model
Distributional utility	pMSE Mechanism	Moderate (MCMC bottleneck)	Not directly addressed	pMSE-optimal
Sampling & optimization	Private Sampling (BSV), B & L	Polynomial for low d	Not directly addressed	Uniform deviation in queries
Space partition	PMM, KD-Tree	Moderate	Not directly addressed	Wasserstein/MMD
Deep generative	DP-GAN, PATE-GAN, DP-CGAN	Low (tabular), High (image)	Poor at low ε	Distributional (weak)
Focused risk control	ε-PrivateSMOTE	High (as applied)	Designed for linkage	Empirical coverage/utility
Relational/vertical	DP RelDB, Conditional JAM	Moderate (with limitations)	Not directly addressed	Cross-table marginals/reconstruction

In conclusion, differentially private synthetic data research comprises a rapidly evolving domain at the intersection of privacy, statistics, and algorithm design. A spectrum of mechanisms offer trade-offs between analytical fidelity, computational tractability, and privacy risk. Marginal-based and graphical-model approaches provide scalable and robust solutions for tabular data, while more recent innovations address high-dimensional, continuous, and structured relational settings. Methodological advances continue to refine statistical accuracy, fairness, and practical deployment for real-world privacy-sensitive data sharing (Snoke et al., 2018, Carvalho et al., 2022, Boedihardjo et al., 2021, He et al., 2023, Kreačić et al., 2023, Cunningham et al., 2021, Boedihardjo et al., 2021, Bojkovic et al., 6 May 2024, He et al., 2023, Rosenblatt et al., 2020, McKenna et al., 2021, Alimohammadi et al., 29 May 2024, Pereira et al., 2023, Torkzadehmahani et al., 2020, Bowen et al., 2019, Bullwinkel et al., 2022, Perez et al., 20 Mar 2024, Zhang et al., 2020, Bowen et al., 2016, Maddock et al., 15 Apr 2025).