Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Unified Framework for Quantifying Privacy Risk in Synthetic Data (2211.10459v1)

Published 18 Nov 2022 in cs.CR

Abstract: Synthetic data is often presented as a method for sharing sensitive information in a privacy-preserving manner by reproducing the global statistical properties of the original data without disclosing sensitive information about any individual. In practice, as with other anonymization methods, privacy risks cannot be entirely eliminated. The residual privacy risks need instead to be ex-post assessed. We present Anonymeter, a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. We equip this framework with attack-based evaluations for the singling out, linkability, and inference risks, the three key indicators of factual anonymization according to the European General Data Protection Regulation (GDPR). To the best of our knowledge, we are the first to introduce a coherent and legally aligned evaluation of these three privacy risks for synthetic data, and to design privacy attacks which model directly the singling out and linkability risks. We demonstrate the effectiveness of our methods by conducting an extensive set of experiments that measure the privacy risks of data with deliberately inserted privacy leakages, and of synthetic data generated with and without differential privacy. Our results highlight that the three privacy risks reported by our framework scale linearly with the amount of privacy leakage in the data. Furthermore, we observe that synthetic data exhibits the lowest vulnerability against linkability, indicating one-to-one relationships between real and synthetic data records are not preserved. Finally, we demonstrate quantitatively that Anonymeter outperforms existing synthetic data privacy evaluation frameworks both in terms of detecting privacy leaks, as well as computation speed. To contribute to a privacy-conscious usage of synthetic data, we open source Anonymeter at https://github.com/statice/anonymeter.

A Unified Framework for Quantifying Privacy Risk in Synthetic Data

The paper "A Unified Framework for Quantifying Privacy Risk in Synthetic Data" presents a methodological approach to assess privacy risks inherent to synthetic datasets, particularly in compliance with data protection regulations such as the GDPR. The authors introduce the Statice Privacy Assessment framework which offers empirical tools to evaluate privacy risks in synthetic tabular data, accounting for singling out, linkability, and inference risks. These metrics are crucial for determining the factual anonymization within datasets built using synthetic data.

Overview of the Framework

The framework integrates attack-based evaluations to identify and quantify privacy risks in synthetic datasets. It leverages a statistical approach to systematically uncover privacy risks post-data generation, highlighting the challenges associated with ensuring privacy while maintaining data utility. The authors propose several attacks, including those on singling out, linkability, and inference, which directly model privacy vulnerabilities rather than conceptual analogies like membership inference.

Methodological Implementation

The framework is structured as a modular system, comprising three phases:

  1. Attack Phase: This involves executing three types of attacks—main privacy attack (uses synthetic data to infer original data attributes), a naive attack (random guessing), and control attack (estimates baseline privacy risk on control dataset records).
  2. Evaluation Phase: The outcomes of these attacks are evaluated to determine the success rate, denoting the attack's ability to correctly infer or link original data records.
  3. Risk Quantification Phase: Finally, the framework calculates privacy risk scores based on attack strengths, factoring differences in success rates between main and control attacks to delineate specific privacy risks.

Experimental Results

Comprehensive experiments have been conducted utilizing three distinct datasets—Adults, Texas Hospital Discharge, and the 1940 US Census data, synthesizing data using CTGAN and DPCTGAN models. The results indicate that privacy risks scale linearly with the amount of artificially inserted privacy leaks and are minimized in datasets synthesized with differential privacy. Synthetic datasets tend to show low linkability risks, which implies that one-to-one correspondences between real and synthetic records are not maintained, affirming a line of defense against linkage attacks.

Notably, the Statice Privacy Assessment surpasses existing frameworks such as SDR in terms of computational efficiency and accuracy in detecting privacy breaches. It efficiently measures the privacy leakage without necessitating extensive computational resources or repetitive model training, as seen in previous methodologies.

Implications and Future Directions

By providing an open-source library for privacy risk assessment, the framework enables practitioners to comprehensively address privacy concerns while utilizing synthetic data for practical applications. The modularity of the framework suggests that it can be expanded with additional privacy metrics or further refined attacks. Future research could explore integrating broader facets such as fairness and bias analysis in synthetic datasets to facilitate a holistic assessment of their trustworthiness.

In conclusion, Statice Privacy Assessment significantly contributes to the domain of privacy-preserving data analysis by providing researchers with robust tools to balance utility against privacy, thereby supporting informed decision-making in deploying synthetic datasets for real-world analytics and sharing. As privacy regulations evolve, frameworks like these will be pivotal in establishing trust and compliance in data processing practices.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Matteo Giomi (14 papers)
  2. Franziska Boenisch (40 papers)
  3. Christoph Wehmeyer (4 papers)
  4. Borbála Tasnádi (1 paper)
Citations (40)