Sketches-based join size estimation under local differential privacy (2405.11419v1)
Abstract: Join size estimation on sensitive data poses a risk of privacy leakage. Local differential privacy (LDP) is a solution to preserve privacy while collecting sensitive data, but it introduces significant noise when dealing with sensitive join attributes that have large domains. Employing probabilistic structures such as sketches is a way to handle large domains, but it leads to hash-collision errors. To achieve accurate estimations, it is necessary to reduce both the noise error and hash-collision error. To tackle the noise error caused by protecting sensitive join values with large domains, we introduce a novel algorithm called LDPJoinSketch for sketch-based join size estimation under LDP. Additionally, to address the inherent hash-collision errors in sketches under LDP, we propose an enhanced method called LDPJoinSketch+. It utilizes a frequency-aware perturbation mechanism that effectively separates high-frequency and low-frequency items without compromising privacy. The proposed methods satisfy LDP, and the estimation error is bounded. Experimental results show that our method outperforms existing methods, effectively enhancing the accuracy of join size estimation under LDP.
- Y. Izenov, A. Datta, F. Rusu, and J. H. Shin, “COMPASS: online sketch-based query optimization for in-memory databases,” in SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021. ACM, 2021, pp. 804–816. [Online]. Available: https://doi.org/10.1145/3448016.3452840
- P. Wang, Y. Qi, Y. Zhang, Q. Zhai, C. Wang, J. C. S. Lui, and X. Guan, “A memory-efficient sketch method for estimating high similarities in streaming sets,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. ACM, 2019, pp. 25–33. [Online]. Available: https://doi.org/10.1145/3292500.3330825
- A. S. R. Santos, A. Bessa, F. Chirigati, C. Musco, and J. Freire, “Correlation sketches for approximate join-correlation queries,” in SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021. ACM, 2021, pp. 1531–1544. [Online]. Available: https://doi.org/10.1145/3448016.3458456
- A. Bessa, M. Daliri, J. Freire, C. Musco, C. Musco, A. S. R. Santos, and H. Zhang, “Weighted minwise hashing beats linear sketching for inner product estimation,” in Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023. ACM, 2023, pp. 169–181. [Online]. Available: https://doi.org/10.1145/3584372.3588679
- S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. D. Smith, “What can we learn privately?” 2008 49th Annual IEEE Symposium on Foundations of Computer Science, pp. 531–540, 2008. [Online]. Available: https://api.semanticscholar.org/CorpusID:1935
- T. Wang, J. Blocki, N. Li, and S. Jha, “Locally differentially private protocols for frequency estimation,” in USENIX Security Symposium, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:10051640
- M. Zhang, S. Lin, and L. Yin, “Local differentially private frequency estimation based on learned sketches,” Inf. Sci., vol. 649, p. 119667, 2023. [Online]. Available: https://doi.org/10.1016/j.ins.2023.119667
- J. C. Duchi, M. J. Wainwright, and M. I. Jordan, “Minimax optimal procedures for locally private estimation,” Journal of the American Statistical Association, vol. 113, pp. 182 – 201, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:15762329
- “Learning with privacy at scale differential,” 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:43986173
- Ú. Erlingsson, A. Korolova, and V. Pihur, “Rappor: Randomized aggregatable privacy-preserving ordinal response,” Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:6855746
- G. C. Fanti, V. Pihur, and Ú. Erlingsson, “Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries,” Proceedings on Privacy Enhancing Technologies, vol. 2016, pp. 41 – 61, 2015. [Online]. Available: https://api.semanticscholar.org/CorpusID:9001011
- B. Ding, J. Kulkarni, and S. Yekhanin, “Collecting telemetry data privately,” in Neural Information Processing Systems, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:3277268
- M. Xu, B. Ding, T. Wang, and J. Zhou, “Collecting and analyzing data jointly from multiple services under local differential privacy,” Proceedings of the VLDB Endowment, vol. 13, pp. 2760 – 2772, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221375864
- R. B. Christensen, S. R. Pandey, and P. Popovski, “Semi-private computation of data similarity with applications to data valuation and pricing,” IEEE Trans. Inf. Forensics Secur., vol. 18, pp. 1978–1988, 2023. [Online]. Available: https://doi.org/10.1109/TIFS.2023.3259879
- J. Bater, Y. Park, X. He, X. Wang, and J. Rogers, “SAQE: practical privacy-preserving approximate query processing for data federations,” Proc. VLDB Endow., vol. 13, no. 11, pp. 2691–2705, 2020. [Online]. Available: http://www.vldb.org/pvldb/vol13/p2691-bater.pdf
- J. Ock, T. Lee, and S. Kim, “Privacy-preserving approximate query processing with differentially private generative models,” in IEEE International Conference on Big Data, BigData 2023, Sorrento, Italy, December 15-18, 2023. IEEE, 2023, pp. 6242–6244. [Online]. Available: https://doi.org/10.1109/BigData59044.2023.10386956
- G. Cormode, S. Maddock, and C. Maple, “Frequency estimation under local differential privacy,” Proc. VLDB Endow., vol. 14, pp. 2046–2058, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232427949
- S. Ganguly, P. B. Gibbons, Y. Matias, and A. Silberschatz, “Bifocal sampling for skew-resistant join size estimation,” in ACM SIGMOD Conference, 1996. [Online]. Available: https://api.semanticscholar.org/CorpusID:2892590
- C. Estan and J. F. Naughton, “End-biased samples for join cardinality estimation,” 22nd International Conference on Data Engineering (ICDE’06), pp. 20–20, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID:5265860
- Y. E. Ioannidis and S. Christodoulakis, “Optimal histograms for limiting worst-case error propagation in the size of join results,” ACM Trans. Database Syst., vol. 18, pp. 709–748, 1993. [Online]. Available: https://api.semanticscholar.org/CorpusID:16703047
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy, “Tracking join and self-join sizes in limited storage,” in ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1999. [Online]. Available: https://api.semanticscholar.org/CorpusID:1650858
- G. Cormode and M. N. Garofalakis, “Sketching streams through the net: Distributed approximate query tracking,” in Very Large Data Bases Conference, 2005. [Online]. Available: https://api.semanticscholar.org/CorpusID:3402807
- H. Chen, Z. Wang, Y. Li, R. Yang, Y. Zhao, R. Zhou, and K. Zheng, “Deep learning-based bloom filter for efficient multi-key membership testing,” Data Science and Engineering, vol. 8, pp. 234–246, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:261499850
- S. Ganguly, M. N. Garofalakis, and R. Rastogi, “Processing data-stream join aggregates using skimmed sketches,” in International Conference on Extending Database Technology, 2004. [Online]. Available: https://api.semanticscholar.org/CorpusID:11330374
- S. Ganguly, D. Kesh, and C. Saha, “Practical algorithms for tracking database join sizes,” in Foundations of Software Technology and Theoretical Computer Science, 2005. [Online]. Available: https://api.semanticscholar.org/CorpusID:1195913
- F. Wang, Q. Chen, Y. Li, T. Yang, Y. Tu, L. Yu, and B. Cui, “Joinsketch: A sketch algorithm for accurate and unbiased inner-product estimation,” Proceedings of the ACM on Management of Data, vol. 1, pp. 1 – 26, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259077177
- S. Aydöre, W. Brown, M. Kearns, K. Kenthapadi, L. Melis, A. Roth, and A. Siva, “Differentially private query release through adaptive projection,” in International Conference on Machine Learning, 2021.
- T. Wang, N. Li, and S. Jha, “Locally differentially private frequent itemset mining,” 2018 IEEE Symposium on Security and Privacy (SP), pp. 127–143, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:50787144
- A. Triastcyn and B. Faltings, “Bayesian differential privacy for machine learning,” in International Conference on Machine Learning, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:199472691
- H. Jiang, J. Pei, D. Yu, J. Yu, B. Gong, and X. Cheng, “Applications of differential privacy in social network analysis: A survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, pp. 108–127, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:235083200
- C. Dwork, F. McSherry, K. Nissim, and A. D. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of Cryptography Conference, 2006. [Online]. Available: https://api.semanticscholar.org/CorpusID:2468323
- V. V. Williams, Y. Xu, Z. Xu, and R. Zhou, “New bounds for matrix multiplication: from alpha to omega,” in Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2024.