Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Privacy-Enhanced Database Synthesis for Benchmark Publishing (Technical Report) (2405.01312v2)

Published 2 May 2024 in cs.DB and cs.CR

Abstract: Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating synthesized databases for benchmarking that also prioritize privacy protection. Differential privacy (DP)-based data synthesis has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or downstream ML tasks, with less attention given to benchmarking factors like query runtime performance. This paper delves into differentially private database synthesis specifically for benchmark publishing scenarios, aiming to produce a synthetic database whose benchmarking factors closely resemble those of the original data. Introducing \textit{PrivBench}, an innovative synthesis framework based on sum-product networks (SPNs), we support the synthesis of high-quality benchmark databases that maintain fidelity in both data distribution and query runtime performance while preserving privacy. We validate that PrivBench can ensure database-level DP even when generating multi-relation databases with complex reference relationships. Our extensive experiments show that PrivBench efficiently synthesizes data that maintains privacy and excels in both data distribution similarity and query runtime similarity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Tpc benchmarks. https://www.tpc.org/.
  2. Differentially private query release through adaptive projection. In ICML, pages 457–467. PMLR, 2021.
  3. Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 273–282, 2007.
  4. Data synthesis via differentially private markov random fields. PVLDB, 14(11):2190–2202, 2021.
  5. Privlava: synthesizing relational data with foreign keys under differential privacy. SIGMOD, 1(2):1–25, 2023.
  6. M. Center. Integrated public use microdata series, international: Version 7.3 [data set]. minneapolis, mn: Ipums, 2020.
  7. Gs-wgan: A gradient-sanitized approach for learning differentially private generators. NeurIPS, 33:12673–12684, 2020.
  8. R2t: Instance-optimal truncation for differentially private query evaluation with foreign keys. In Proceedings of the 2022 International Conference on Management of Data, pages 759–772, 2022.
  9. C. Dwork. Differential privacy. In ICALP, pages 1–12. Springer, 2006.
  10. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  11. Kamino: Constraint-aware differentially private data synthesis. arXiv preprint arXiv:2012.15713, 2020.
  12. Synthesizing linked data under cardinality and integrity constraints. In Proceedings of the 2021 International Conference on Management of Data, pages 619–631, 2021.
  13. Deepdb: learn from data, not from queries! PVLDB, 13(7):992–1005, 2020.
  14. Pate-gan: Generating synthetic data with differential privacy guarantees. In ICLR, 2018.
  15. Learned cardinalities: Estimating correlated joins with deep learning. arXiv preprint arXiv:1809.00677, 2018.
  16. Privatesql: a differentially private sql query engine. Proceedings of the VLDB Endowment, 12(11):1371–1384, 2019.
  17. How good are query optimizers, really? PVLDB, 9(3):204–215, 2015.
  18. Dpsynthesizer: differentially private data synthesizer for privacy preserving data sharing. In PVLDB, volume 7, page 1677. NIH Public Access, 2014.
  19. Homeseeker: A visual analytics system of real estate data. Journal of Visual Languages & Computing, 45:1–16, 2018.
  20. Dpsyn: Experiences in the nist differential privacy data synthesis challenges. arXiv preprint arXiv:2106.12949, 2021.
  21. Tabular data synthesis with generative adversarial networks: design space and optimizations. The VLDB Journal, pages 1–26, 2023.
  22. G-pate: Scalable differentially private data generator via private aggregation of teacher discriminators. NeurIPS, 34:2965–2977, 2021.
  23. Winning the nist contest: A scalable and general approach to differentially private synthetic data. arXiv preprint arXiv:2108.04978, 2021.
  24. Aim: An adaptive and iterative mechanism for differentially private synthetic data. arXiv preprint arXiv:2201.12677, 2022.
  25. Graphical-model based estimation and inference for differential privacy. In ICML, pages 4435–4444. PMLR, 2019.
  26. Preventing bad plans by bounding the impact of cardinality estimation errors. PVLDB, 2(1):982–993, 2009.
  27. Datasynthesizer: Privacy-preserving synthetic datasets. In SSDBM, pages 1–5, 2017.
  28. H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In ICCV Workshops, pages 689–690. IEEE, 2011.
  29. Prefair: Privately generating justifiably fair synthetic data. arXiv preprint arXiv:2212.10310, 2022.
  30. J. Snoke and A. Slavković. pmse mechanism: differentially private synthetic data with maximal distributional similarity. In PSD, pages 138–159. Springer, 2018.
  31. Differentially private k-means clustering. In Proceedings of the sixth ACM conference on data and application security and privacy, pages 26–37, 2016.
  32. Computing local sensitivities of counting queries with joins. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 479–494, 2020.
  33. Benchmarking differentially private synthetic data generation algorithms. arXiv preprint arXiv:2112.09238, 2021.
  34. Dp-cgan: Differentially private synthetic data and label generation. In CVPR Workshops, pages 0–0, 2019.
  35. New oracle-efficient algorithms for private synthetic data release. In ICML, pages 9765–9774. PMLR, 2020.
  36. Differentially private generative adversarial network. arXiv preprint arXiv:1802.06739, 2018.
  37. Sam: Database generation from query workloads with supervised autoregressive models. In SIGMOD, pages 1542–1555, 2022.
  38. Neurocard: one cardinality estimator for all tables. PVLDB, 14(1):61–73, 2020.
  39. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems, 42(4):1–41, 2017.
  40. Learnedsqlgen: Constraint-aware sql generation using reinforcement learning. In SIGMOD, pages 945–958, 2022.
  41. {{\{{PrivSyn}}\}}: Differentially private data synthesis. In USENIX Security, pages 929–946, 2021.

Summary

We haven't generated a summary for this paper yet.