Synthetic Comparison Data
- Synthetic Comparison Data are artificially generated datasets that mimic original data distributions for rigorous evaluation of statistical methods and machine learning algorithms.
- General utility is assessed by metrics like the propensity score mean squared error (pMSE) to evaluate the overall distributional fidelity between synthetic and real data.
- Specific utility metrics, such as confidence interval overlap and standardized coefficient differences, ensure that synthetic data reliably support accurate inferential analyses.
Synthetic comparison data are artificially generated datasets designed for the explicit purpose of benchmarking, validating, or evaluating statistical properties, inference methodologies, or machine learning algorithms. Unlike generic synthetic data, synthetic comparison data are constructed to enable rigorous, quantitative comparisons between the synthetic and original datasets—particularly with respect to their multivariate distributional fidelity, analysis-specific inference consistency, and statistical utility across diverse analytic tasks. Systematic evaluation using synthetic comparison data is foundational for ensuring statistical validity, privacy, and robustness in data-sharing environments where original data access is restricted due to confidentiality or sensitivity constraints.
1. Conceptual Foundations and Utility Domains
Synthetic comparison data serve as statistical surrogates for original datasets, with their utility quantified along two principal axes:
- General Utility: Captures the global, distributional similarity between synthetic and real data. General utility focuses on reproducing the overall multivariate structure, without regard to any specific downstream analysis.
- Specific Utility: Measures how well specific analyses (such as regression coefficients, confidence intervals, or summary estimates) agree when performed on synthetic versus original data. This axis is analysis-specific and targets inferential validity for particular use cases.
The separation between general and specific utility reflects the realization that fidelity to the underlying data-generating process (DGP) does not always translate into inference equivalence for every applied scenario (Snoke et al., 2016). Consequently, robust synthetic comparison data frameworks must address both aspects.
2. General Utility Measures: Theory and Implementation
General utility is most commonly measured using metrics grounded in the ability to distinguish synthetic from real data when both are available. A canonical approach is the propensity score mean squared error (pMSE). The methodology is as follows:
- Stack the original and synthetic datasets, introducing an indicator variable (with for original, for synthetic).
- Fit a probabilistic classifier (typically logistic regression, but nonparametric approaches such as CART are effective in high dimension (Raab et al., 2021)) using relevant features to predict .
- Compute the predicted propensity scores and the overall proportion of synthetic data ( synthetic, total records).
- The general utility metric is:
Under the correct synthesis (CS) model and a logistic model with parameters (including the intercept), the null expectations are:
- Standardized versions, such as the pMSE ratio (observed/expected) and standardized pMSE ((observed expected)/StdDev), contextualize the observed discrepancy relative to the null, flagging deviations due to model misspecification or synthetic data artifacts (Snoke et al., 2016).
This approach provides a theoretically grounded, interpretable measure for assessing whether the joint distribution of the synthetic data aligns with that of the original. pMSE and related scores have been incorporated into widely used synthetic data tools (e.g., the synthpop package (Raab et al., 2021)).
3. Specific Utility Measures and Their Role
Specific utility considers the impact of synthetic data on concrete analytic tasks. Two widely adopted metrics are:
- Confidence Interval Overlap (IO):
where and are the lower and upper confidence interval bounds for the original and synthetic datasets, respectively. IO near 1 signals strong concordance.
- Standardized Difference in Estimates:
Small values of StdDiff indicate close inferential agreement.
These metrics extend to contrasts between model coefficients, summary statistics, or other inferential targets. Unlike general utility, specific utility can be strongly influenced by how well the synthetic generation process preserves relationships pertinent to the target analysis. There can be situations—especially when the generating model "bakes in" analyst-specific assumptions—where specific utility remains high even if synthetic data diverges globally from the original distribution (Snoke et al., 2016).
4. Synthesis of General and Specific Utility: Comparative Evidence
Simulation studies using correlated Gaussian data, as well as applications to survey and census records, reveal the interplay and divergence between general and specific utility measures:
Synthesis Model | General Utility (pMSE ratio) | Specific Utility (CI Overlap, StdDiff) | Observed Concordance |
---|---|---|---|
Correct (fully specified) | ≈1 | Near 1 (high overlap), ≈0 (small diff) | Yes |
Misspecified (ignores structure) | ≫1 | Low overlap, larger differences | Yes |
Model-focused (parametric synthesis) | ≈1 | Sometimes high, sometimes misleading | Conditional |
General and specific utility often agree in ranking synthesis methods, but critical exceptions arise when synthetic data adheres strictly to a subset of relationships (e.g., regression-focused synthesis) while failing in marginal or joint distribution aspects not directly tied to the analyst's model (Snoke et al., 2016). Thus, combined reporting is essential.
5. Application and Recommendations for Data Custodians and Analysts
An integrated evaluation pipeline for synthetic comparison data should:
- Compute general utility (pMSE and scaled variants) as a primary screen for distributional similarity. This captures global quality and reveals structural mismatches.
- Assess specific utility for key analytic targets using metrics such as confidence interval overlap and standardized coefficient differences.
- In complex or high-dimensional settings, employ nonparametric classifiers (e.g., CART) for pMSE calculation to robustly capture interactions and nonlinearities (Raab et al., 2021).
- Use visualization tools (e.g., via synthpop) to identify problematic variables, table cells, or interactions for iterative improvement.
- Interpret high general utility in the absence of specific utility (or vice versa) as a strong indicator to investigate potential overfitting to analyst assumptions or missed global structure.
Packages such as synthpop implement these measures, providing diagnostics to improve synthetic data quality prior to release.
6. Examples and Empirical Insights
- Simulation: For multivariate Normal data with predictors and matched sample sizes, correct synthesis yields pMSE ≈ theoretical expectation (e.g., ), pMSE ratio ≈ 1, and standardized pMSE ≈ 0, with both general and specific utility measures agreeing. Misspecified methods (e.g., ignoring correlations) yield much larger pMSE and discrepancies in parameter estimates.
- Scottish Health Survey: Comparison of parametric, nonparametric (CART), and random sampling methods revealed that simple random sampling performed poorly on both general and specific utility, while parametric and nonparametric syntheses performed similarly, except for specific regression analyses.
- Historical Census Data: Situations where variables are not fully synthesized can result in high specific utility but poor general utility. For instance, when marginal distributions were visibly distorted (with impossible values), only the general utility metrics flagged the issue (Snoke et al., 2016).
7. Implications and Best Practices
Synthetic comparison data, assessed using a combination of general and specific utility diagnostics, are necessary to ensure both privacy-preserving releases and scientifically valid analyses. General utility metrics (pMSE and its scaled forms) give robust, interpretable diagnostics for multivariate fidelity and should be used routinely. At the same time, specific utility metrics provide direct evidence of inferential preservation for intended statistical models and policy studies. Neither should be omitted: certain synthesis approaches can mask flaws if only one perspective is evaluated.
The research consensus is that a dual-evaluation approach increases analytical confidence, particularly when results from both metrics concur. Discrepancies should prompt additional diagnostics, possibly including visual inspections and marginal checks. Such practices are now being incorporated into standard synthetic data release protocols and analytic packages, making the rigorous comparison of synthetic data integral to modern applied statistics and data science (Snoke et al., 2016, Raab et al., 2021).
Summary Table: Key Measures for Synthetic Comparison Data
Utility Dimension | Metric | Interpretation | Theoretical Reference |
---|---|---|---|
General | pMSE, pMSE Ratio | Distributional similarity (global) | (Snoke et al., 2016, Raab et al., 2021) |
Specific | CI Overlap, StdDiff | Inference preservation (analysis) | (Snoke et al., 2016) |
Synthetic comparison data underpin reproducibility, utility assurance, and privacy in modern confidential data analysis workflows.