2000 character limit reached
A Comparison of SynDiffix Multi-table versus Single-table Synthetic Data (2403.08463v1)
Published 13 Mar 2024 in cs.CR
Abstract: SynDiffix is a new open-source tool for structured data synthesis. It has anonymization features that allow it to generate multiple synthetic tables while maintaining strong anonymity. Compared to the more common single-table approach, multi-table leads to more accurate data, since only the features of interest for a given analysis need be synthesized. This paper compares SynDiffix with 15 other commercial and academic synthetic data techniques using the SDNIST analysis framework, modified by us to accommodate multi-table synthetic data. The results show that SynDiffix is many times more accurate than other approaches for low-dimension tables, but somewhat worse than the best single-table techniques for high-dimension tables.
- Office of National Statistics (ONS) . Protecting personal data in Census 2021 results . https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/methodologies/protectingpersonaldataincensus2021results, 2021.
- US Census Bureau . US Census Bureau Geographic Entities and Concepts . https://www.census.gov/content/dam/Census/data/developers/geoareaconcepts.pdf.
- US Census Bureau . ACS PUMS Files: The Basics . https://www.census.gov/content/dam/Census/library/publications/2021/acs/acs_pums_handbook_2021_ch01.pdf, 2020.
- US Census Bureau . The American Community Survey . https://www2.census.gov/programs-surveys/acs/methodology/questionnaires/2020/quest20.pdf, 2020.
- US National Institute of Standards and Technology (NIST) . Collaborative Research Cycle 2023 . https://pages.nist.gov/privacy_collaborative_research_cycle/, 2023.
- Generative modeling of complex data. arXiv preprint arXiv:2202.02145, 2022.
- C. Dwork. Differential Privacy. In ICALP, 2006.
- Syndiffix: More accurate synthetic structured data. arXiv preprint arXiv:2311.09628, 2023.
- P. Francis and D. Wagner. Towards more accurate and useful data anonymity vulnerability measures. arXiv preprint arXiv:2403.06595, 2024.
- Perspectives for tabular data protection–how about synthetic data? In International Conference on Privacy in Statistical Databases, pages 77–91. Springer, 2022.
- Pate-gan: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2018.
- Generating private synthetic data with genetic algorithms. In International Conference on Machine Learning, pages 22009–22027. PMLR, 2023.
- Aim: an adaptive and iterative mechanism for differentially private synthetic data. Proc. VLDB Endow., 15(11):2599–2612, jul 2022.
- Graphical-model based estimation and inference for differential privacy. In International Conference on Machine Learning, pages 4435–4444. PMLR, 2019.
- B. Meindl and M. Templ. Feedback-based integration of the whole process of data anonymization in a graphical interface. Algorithms, 12(9):191, 2019.
- synthpop: Bespoke creation of synthetic data in r. Journal of statistical software, 74:1–26, 2016.
- General and specific utility measures for synthetic data. Journal of the Royal Statistical Society Series A: Statistics in Society, 181(3):663–688, 2018.
- Statistical disclosure control for micro-data using the r package sdcmicro. Journal of Statistical Software, 67(i04), 2015.
- Methodology for the automatic confidentialisation of statistical outputs from remote servers at the australian bureau of statistics. Joint UNECE/Eurostat work session on statistical data confidentiality, Ottawa, Canada, 2013.
- Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019.
- On improving fairness of ai models with synthetic minority oversampling techniques. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), pages 874–882. SIAM, 2023.