Generating Synthetic Data with Locally Estimated Distributions for Disclosure Control

Published 3 Oct 2022 in stat.CO and stat.ML | (2210.00884v2)

Abstract: Sensitive datasets are often underutilized in research and industry due to privacy concerns, limiting the potential of valuable data-driven insights. Synthetic data generation presents a promising solution to address this challenge by balancing privacy protection with data utility. This paper introduces a new approach to mitigate privacy risks associated with outlier observations in synthetic datasets: the Local Resampler (LR). The LR leverages the $k$-nearest neighbors algorithm to generate synthetic data while minimizing disclosure risks by underrepresenting outliers, even when they are not detectable in marginal distributions. Theoretical and empirical analyses demonstrate that the LR effectively mitigates outlier-driven disclosure risks, and accurately replicates multimodal, skewed, and non-convex support distributions. The semiparametric nature of the LR ensures a low computational burden and works efficiently even with small samples. By parameterizing the balance between privacy risks and data utility, this approach promotes broader access to sensitive datasets for research.