Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown (2401.18024v2)

Published 31 Jan 2024 in cs.CR

Abstract: Differential privacy (DP) is increasingly used to protect the release of hierarchical, tabular population data, such as census data. A common approach for implementing DP in this setting is to release noisy responses to a predefined set of queries. For example, this is the approach of the TopDown algorithm used by the US Census Bureau. Such methods have an important shortcoming: they cannot answer queries for which they were not optimized. An appealing alternative is to generate DP synthetic data, which is drawn from some generating distribution. Like the TopDown method, synthetic data can also be optimized to answer specific queries, while also allowing the data user to later submit arbitrary queries over the synthetic population data. To our knowledge, there has not been a head-to-head empirical comparison of these approaches. This study conducts such a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity, in-distribution vs. out-of-distribution queries, and privacy guarantees. Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated; for instance, in our experiments, TopDown achieved at least $20\times$ lower error on counting queries than the leading synthetic data method at the same privacy budget. Our findings suggest guidelines for practitioners and the synthetic data research community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. The 2020 census disclosure avoidance system topdown algorithm. Harvard Data Science Review, (Special Issue 2).
  2. A novel analysis of utility in privacy pipelines, using Kronecker products and quantitative information flow. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 1718–1731.
  3. Differentially private query release through adaptive projection. In International Conference on Machine Learning, 457–467. PMLR.
  4. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, 635–658. Springer.
  5. Census of India. 2023. Census Tables. Https://censusindia.gov.in/census.website/data/census-tables.
  6. Ron-gauss: Enhancing utility in non-interactive private data release. arXiv preprint arXiv:1709.00054.
  7. Approximating discrete probability distributions with dependence trees. IEEE transactions on Information Theory, 14(3): 462–467.
  8. Census TopDown: The impacts of differential privacy on redistricting. arXiv preprint arXiv:2203.05085.
  9. Linear Program Reconstruction in Practice. Journal of Privacy and Confidentiality, 10(1).
  10. Retiring Adult: New Datasets for Fair Machine Learning. Advances in Neural Information Processing Systems, 34.
  11. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings 3, 265–284. Springer.
  12. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4): 211–407.
  13. Eurostat. 2023. Population and Housing Censuses Database. Https://ec.europa.eu/eurostat/web/population-demography/population-housing-censuses/database.
  14. Differential privacy of hierarchical census data: An optimization approach. Artificial Intelligence, 296: 103475.
  15. Kamino: Constraint-aware differentially private data synthesis. arXiv preprint arXiv:2012.15713.
  16. gretelai. 2023. GretelRNN Implementation. Https://github.com/gretelai/gretel-synthetics/tree/v0.15.10.
  17. PATE-GAN: Generating synthetic data with differential privacy guarantees. In International conference on learning representations.
  18. Private Synthetic Data with Hierarchical Structure. arXiv preprint arXiv:2206.05942.
  19. McKenna, R. 2023. MST Implementation. Https://github.com/ryan112358/private-pgm/blob/master/mechanisms/mst.py.
  20. Winning the NIST Contest: A scalable and general approach to differentially private synthetic data. arXiv preprint arXiv:2108.04978.
  21. Graphical-model based estimation and inference for differential privacy. In International Conference on Machine Learning, 4435–4444. PMLR.
  22. Robust de-anonymization of large sparse datasets. In 2008 IEEE Symposium on Security and Privacy (sp 2008), 111–125. IEEE.
  23. Differentially private synthetic data: Applied evaluations and enhancements. arXiv preprint arXiv:2011.05537.
  24. Singapore Department of Statistics. 2023. Singapore Statistics Table Builder. Https://tablebuilder.singstat.gov.sg/.
  25. Benchmarking differentially private synthetic data generation algorithms. arXiv preprint arXiv:2112.09238.
  26. United States Census Bureau. 2017. International Datasets. Https://www.kaggle.com/datasets/census/international-data.
  27. New oracle-efficient algorithms for private synthetic data release. In International Conference on Machine Learning, 9765–9774. PMLR.
  28. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739.
  29. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4): 1–41.
  30. A synthetic population of Sweden: datasets of agents, households, and activity-travel patterns. Data in Brief, 48: 109209.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Aadyaa Maddi (3 papers)
  2. Swadhin Routray (1 paper)
  3. Alexander Goldberg (6 papers)
  4. Giulia Fanti (55 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com