Winning the NIST Contest: A scalable and general approach to differentially private synthetic data (2108.04978v1)

Published 11 Aug 2021 in cs.CR

Abstract: We propose a general approach for differentially private synthetic data generation, that consists of three steps: (1) select a collection of low-dimensional marginals, (2) measure those marginals with a noise addition mechanism, and (3) generate synthetic data that preserves the measured marginals well. Central to this approach is Private-PGM, a post-processing method that is used to estimate a high-dimensional data distribution from noisy measurements of its marginals. We present two mechanisms, NIST-MST and MST, that are instances of this general approach. NIST-MST was the winning mechanism in the 2018 NIST differential privacy synthetic data competition, and MST is a new mechanism that can work in more general settings, while still performing comparably to NIST-MST. We believe our general approach should be of broad interest, and can be adopted in future mechanisms for synthetic data generation.

PDF Abstract

Winning the NIST Contest: A Scalable and General Approach to Differentially Private Synthetic Data

This essay provides a technical overview of the paper titled "Winning the NIST Contest: A scalable and general approach to differentially private synthetic data," with a focus on its approach to differentially private synthetic data generation. The proposed method is significant within the domain of data privacy, particularly in contexts where the release of synthetic data must maintain individual confidentiality as dictated by differential privacy (DP).

Summary of the Paper

The authors present a framework for generating differentially private synthetic data by following a three-step process: (1) selecting a collection of low-dimensional marginals, (2) measuring these marginals using a noise-addition mechanism, and (3) generating synthetic data that closely aligns with the noisy measurements of these marginals. Central to this approach is Private-PGM, a post-processing method that estimates high-dimensional data distributions from noisy marginal measurements. The method was exemplified through the development of the NIST-MST and MST mechanisms, the former being the winning entry of the 2018 NIST differential privacy synthetic data competition.

Technical Approach

Selection of Marginals

Choosing appropriate marginals is essential to ensuring that the synthetic data retains important statistical properties of the original dataset. The selection can be informed by domain knowledge or automated algorithms. In the case of the NIST competition, attribute selection involved using mutual information to identify important pairs and expand into higher dimensions in a tree-structured manner, aiding efficient graphical model inference.

Measurement with the Gaussian Mechanism

Once selected, marginals are measured using the Gaussian mechanism, a common approach in differential privacy which adds noise calibrated to the data's sensitivity, thereby preserving privacy. The method ensures controlled noise addition that limits privacy loss, while still allowing for realistic preservation of data properties.

Generating Synthetic Data with Private-PGM

Private-PGM takes the noisy measurements and estimates a credible data distribution. By leveraging factor graphs and belief propagation techniques, Private-PGM navigates the challenge of scaling to high-dimensional domains while maintaining utility. The resulting synthetic data attempts to replicate the original data's distribution properties, maintaining both statistical integrity and privacy guarantees.

Performance and Evaluation

The performance of the NIST-MST algorithm was rigorously tested and demonstrated strong results on evaluation metrics. Scores were based on the accuracy of the synthetic data's marginals and high-order conjunctions, reflecting essential distributional characteristics. Empirical evaluations highlighted the method's capability to generate data with high utility across varying privacy requirements.

Extensions and Practical Implications

The paper introduces MST, a mechanism informed by NIST-MST that operates without relying on public provisional datasets. Using differential privacy techniques, MST ensures privacy even when selecting which data statistics to preserve, broadening the method's applicability in scenarios lacking pre-existing datasets.

The implications of this research are evident in numerous fields that require data utility and privacy, such as healthcare and finance, where sensitive information is frequent. With Private-PGM at the core, the potential for broader applications in synthetic data generation is significant, leveraging its ability to scale and effectively utilize the privacy-utility tradeoff.

Conclusion

The research provides a scalable, general approach to differentially private synthetic data that naturally integrates state-of-the-art techniques — from mutual information-based selection to graphical model inference. By advancing the methodologies for balancing privacy and data utility, the authors contribute foresight into future scholarly and practical applications of synthetic data generation under privacy constraints. This work serves as a robust foundation upon which future developments can build, further exploring avenues for adaptive and domain-specific enhancements in differential privacy applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Ryan McKenna (26 papers)
Gerome Miklau (33 papers)
Daniel Sheldon (39 papers)

Citations (102)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ryan112358/private-pgm: An implementation of the tools described in the paper entitled "Graphical-model based estimation and inference for differential privacy" (103 stars)