Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data Synthesis Framework

Updated 28 January 2026
  • Data Synthesis Framework is a structured methodology for generating synthetic data that replicates real-world distributions and dependencies using advanced statistical metrics.
  • It integrates criteria such as Wasserstein distance, Cramér's V, novelty scores, and domain-classifier AUC to benchmark and rank generative models effectively.
  • The modular implementation using Python libraries ensures reproducible evaluations and provides extensibility for applying the framework to various data types.

A Data Synthesis Framework is a structured methodology for generating synthetic data that mimics statistical properties of real data, with rigorously defined criteria for quantifying fidelity and practical workflows for comparing and ranking generative models. Such frameworks are essential for systematically benchmarking, selecting, and deploying synthetic data generators for machine learning, privacy, or data augmentation contexts.

1. Statistical Criteria for Data Synthesis Evaluation

A rigorous data synthesis framework incorporates complementary metrics to assess the representational quality of synthetic datasets, focusing on how well they reproduce distributions, coverage, and dependencies observed in real data. The framework presented by Livieris et al. (Livieris et al., 2024) exemplifies this approach by unifying the following metrics:

  • Univariate-style distances:

    • For continuous features, the framework deploys the 1D Wasserstein distance:

    W(p,q)=infγΓ(p,q)xy  dγ(x,y)W(p,q) = \inf_{\gamma\in\Gamma(p,q)} \int |x-y|\;d\gamma(x,y)

    where pp and qq are empirical marginals from real and synthetic data, respectively. - For categorical features, Cramér’s V is used based on the Pearson χ2\chi^2 statistic.

  • Novelty score (instance-diversification):

Quantifies the proportion of synthetic records with no close real-data neighbor under the \ell_\infty norm. Specifically:

Novelty={sDS:rDR,sr>α}DS\text{Novelty} = \frac{|\{ s \in D_S :\, \forall\, r \in D_R,\, \|s - r\|_\infty > \alpha \}|}{|D_S|}

This measures the degree of support coverage and instance-level diversity.

  • Domain-classifier indistinguishability:

Measures the ability of a binary classifier to distinguish between real and synthetic data. The closer the AUC to $0.5$, the better the synthetic data mimics the original joint distribution.

  • Anomaly-detection conformity:

Fits an isolation forest on the real data and applies it to synthetic samples. Low mean anomaly scores indicate that synthetic records lie in high-density regions of the real data manifold.

These metrics jointly span marginal similarity, coverage of the support, joint-distribution fidelity, and conformity to the data manifold.

2. Non-Parametric Model Ranking and Statistical Testing

Since these metrics are heterogeneous and may operate on different scales, the framework uses a statistical ranking procedure based on the Friedman Aligned-Rank (FAR) transformation, coupled with Finner’s step-down post hoc adjustment, to establish a defensible ordering among competing synthesizers:

  1. FAR Transformation: Each generative model receives a per-test metric score, which is converted into an aligned-rank and aggregated. The FAR statistic is formally given by:

FAR=(k1)[i=1k(j=1nRij)2kn24(kn+1)2]kn(kn+1)(2kn+1)61kj=1n(i=1kRij)2F_{AR} = \frac{(k-1)\Bigl[\sum_{i=1}^k\bigl(\sum_{j=1}^n R_{i j}\bigr)^2 - \frac{k n^2}{4}(kn+1)^2\Bigr]} {\frac{kn(kn+1)(2kn+1)}{6} - \frac1k\sum_{j=1}^n\bigl(\sum_{i=1}^k R_{i j}\bigr)^2}

Under the null hypothesis, where all models are equivalent, this asymptotically follows a χk12\chi^2_{k-1} distribution.

  1. Global and Pairwise Testing: If the global test rejects the null (pglobal<αp_\text{global} < \alpha), the framework applies a Finner post hoc adjustment to identify which models significantly differ, using:

APVi=min{1,maxji[1(1p(j))(k1)/j]}APV_i = \min\left\{\, 1,\max_{j\le i}\left[1-(1-p_{(j)})^{(k-1)/j}\right]\right\}

This procedure strictly controls family-wise error rates, producing interpretable, statistically robust rankings.

3. Implementation and Usage Protocol

The framework is implemented in a modular pseudocode pipeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def evaluate_synthesizers(D_real, synthesizers, α=0.05, novelty_thresh=α):
    scores = {model.__name__: {} for model in synthesizers}
    for S in synthesizers:
        D_syn = S.fit_sample(D_real)
        if not diagnostic_validity(D_syn, D_real):
            continue
        ws = mean_wasserstein_cramerv(D_real, D_syn)
        nov = novelty_score(D_real, D_syn, threshold=novelty_thresh)
        auc = domain_classifier_auc(D_real, D_syn, folds=5)
        anom = isolation_forest_score(D_real, D_syn)
        scores[S.__name__] = {'WS': ws, 'Novelty': nov, 'AUC': auc, 'Anom': anom}
    aligned_ranks = far_align_ranks(scores)
    F_stat, p_global = friedman_aligned_test(aligned_ranks)
    if p_global < α:
        posthoc = finner_posthoc(aligned_ranks, α)
    else:
        posthoc = None
    return scores, aligned_ranks, F_stat, p_global, posthoc

Each helper routine is implementable using common Python scientific computing libraries.

4. Empirical Evaluation: Use Cases

The framework has been empirically validated on use cases demonstrating its discriminative power:

  • Travel Review Ratings:

Five synthesizers were compared (Gaussian Copula, GMM, CTGAN, TVAE, CopulaGAN) on 24-dimensional ordinal data. Individual component metrics (e.g., lowest Wasserstein, zero Novelty) may disagree, but the FAR/Finner setup correctly ranks GMM highest overall, confirmed by geometric visualization of synthetic sample support.

  • Obesity Risk Dataset:

Comprising mixed categorical and continuous features, the framework identifies CTGAN as the top performer (mean aligned rank ≈5.6), with significant advantage over Copula-based models, and visually confirms manifold-level fidelity.

5. Scope, Limitations, and Extension Strategies

Scope

  • The framework, as proposed, evaluates unlabeled tabular synthetic data generators.

Limitations

  • Does not directly address time-series, graph, or image modalities.
  • Absolute magnitude of individual metrics is discarded in favor of robust rank-based comparisons.
  • Task-based utility metrics for labeled data must be incorporated separately.

Extensions

  • Analogous test suites can be constructed for other data modalities (e.g., Fréchet Inception Distance for images).
  • Multivariate or distribution-based tests such as Kolmogorov-Smirnov, Kullback-Leibler, or Maximum Mean Discrepancy can be integrated as additional test columns.
  • Weighted rank aggregation can reflect domain- or application-specific priorities.
  • Task-based utility metrics can be added for supervised or semi-supervised synthetic data evaluation.

A modular architecture facilitates extension: each metric is a functional column, and new tests or weighting schemes are readily accommodated.

6. Reproducibility and Implementation Guidance

The framework's full evaluation and ranking process is reproducible with open-source statistical and machine learning libraries. Each score is computed with readily accessible packages (SciPy, scikit-learn, PyOD), and the critical FAR/Finner step is based on standard non-parametric hypothesis tests and multiple testing correction literature. By structuring outputs as a scores matrix (models × metrics), the framework lends itself to both programmatic use and methodological extension.

7. Generalization to Other Data Types

While the original scope is unlabeled tabular data, the underlying philosophy applies broadly:

  • For image data: Fréchet Inception Distance (FID) and analogous domain classifiers can replace Wasserstein and AUC.
  • For text: metrics such as BLEU or task-specific evaluation (e.g., for LLM-generated synthetic data).
  • For time-series: Dynamic Time Warping (DTW)-based distances and predictive utility from synthetic forecasting.
  • For each new modality, domain-appropriate feature-level metrics and distributional tests can be slotted into the evaluation matrix. Modality-agnostic model-ranking via non-parametric tests remains applicable.

In summary, a state-of-the-art data synthesis framework for model evaluation combines multiple interpretable quality metrics, leverages robust, non-parametric statistical testing for ranking, and is modular and extensible across domains, offering a scientifically principled way to benchmark synthetic data generation pipelines (Livieris et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data Synthesis Framework.