Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comparison of SynDiffix Multi-table versus Single-table Synthetic Data (2403.08463v1)

Published 13 Mar 2024 in cs.CR

Abstract: SynDiffix is a new open-source tool for structured data synthesis. It has anonymization features that allow it to generate multiple synthetic tables while maintaining strong anonymity. Compared to the more common single-table approach, multi-table leads to more accurate data, since only the features of interest for a given analysis need be synthesized. This paper compares SynDiffix with 15 other commercial and academic synthetic data techniques using the SDNIST analysis framework, modified by us to accommodate multi-table synthetic data. The results show that SynDiffix is many times more accurate than other approaches for low-dimension tables, but somewhat worse than the best single-table techniques for high-dimension tables.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. Office of National Statistics (ONS) . Protecting personal data in Census 2021 results . https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/methodologies/protectingpersonaldataincensus2021results, 2021.
  2. US Census Bureau . US Census Bureau Geographic Entities and Concepts . https://www.census.gov/content/dam/Census/data/developers/geoareaconcepts.pdf.
  3. US Census Bureau . ACS PUMS Files: The Basics . https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021_ch01.pdf, 2020.
  4. US Census Bureau . The American Community Survey . https://www2.census.gov/programs-surveys/acs/methodology/questionnaires/2020/quest20.pdf, 2020.
  5. US National Institute of Standards and Technology (NIST) . Collaborative Research Cycle 2023 . https://pages.nist.gov/privacy_collaborative_research_cycle/, 2023.
  6. Generative modeling of complex data. arXiv preprint arXiv:2202.02145, 2022.
  7. C. Dwork. Differential Privacy. In ICALP, 2006.
  8. Syndiffix: More accurate synthetic structured data. arXiv preprint arXiv:2311.09628, 2023.
  9. P. Francis and D. Wagner. Towards more accurate and useful data anonymity vulnerability measures. arXiv preprint arXiv:2403.06595, 2024.
  10. Perspectives for tabular data protection–how about synthetic data? In International Conference on Privacy in Statistical Databases, pages 77–91. Springer, 2022.
  11. Pate-gan: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2018.
  12. Generating private synthetic data with genetic algorithms. In International Conference on Machine Learning, pages 22009–22027. PMLR, 2023.
  13. Aim: an adaptive and iterative mechanism for differentially private synthetic data. Proc. VLDB Endow., 15(11):2599–2612, jul 2022.
  14. Graphical-model based estimation and inference for differential privacy. In International Conference on Machine Learning, pages 4435–4444. PMLR, 2019.
  15. B. Meindl and M. Templ. Feedback-based integration of the whole process of data anonymization in a graphical interface. Algorithms, 12(9):191, 2019.
  16. synthpop: Bespoke creation of synthetic data in r. Journal of statistical software, 74:1–26, 2016.
  17. General and specific utility measures for synthetic data. Journal of the Royal Statistical Society Series A: Statistics in Society, 181(3):663–688, 2018.
  18. Statistical disclosure control for micro-data using the r package sdcmicro. Journal of Statistical Software, 67(i04), 2015.
  19. Methodology for the automatic confidentialisation of statistical outputs from remote servers at the australian bureau of statistics. Joint UNECE/Eurostat work session on statistical data confidentiality, Ottawa, Canada, 2013.
  20. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
  21. On improving fairness of ai models with synthetic minority oversampling techniques. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), pages 874–882. SIAM, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.