Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unmasking Vulnerabilities: Cardinality Sketches under Adaptive Inputs (2405.17780v1)

Published 28 May 2024 in cs.DS

Abstract: Cardinality sketches are popular data structures that enhance the efficiency of working with large data sets. The sketches are randomized representations of sets that are only of logarithmic size but can support set merges and approximate cardinality (i.e., distinct count) queries. When queries are not adaptive, that is, they do not depend on preceding query responses, the design provides strong guarantees of correctly answering a number of queries exponential in the sketch size $k$. In this work, we investigate the performance of cardinality sketches in adaptive settings and unveil inherent vulnerabilities. We design an attack against the ``standard'' estimators that constructs an adversarial input by post-processing responses to a set of simple non-adaptive queries of size linear in the sketch size $k$. Empirically, our attack used only $4k$ queries with the widely used HyperLogLog (HLL++)~\citep{hyperloglog:2007,hyperloglogpractice:EDBT2013} sketch. The simple attack technique suggests it can be effective with post-processed natural workloads. Finally and importantly, we demonstrate that the vulnerability is inherent as \emph{any} estimator applied to known sketch structures can be attacked using a number of queries that is quadratic in $k$, matching a generic upper bound.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Analyzing graph structure via linear measurements. In Proceedings of the 2012 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 459–467, 2012. doi: 10.1137/1.9781611973099.40. URL https://epubs.siam.org/doi/abs/10.1137/1.9781611973099.40.
  2. Apache Software Foundation. DataSketches, Accessed: 2024. URL https://datasketches.apache.org. Apache Software Foundation Documentation.
  3. Synthesizing robust adversarial examples. In International conference on machine learning, pages 284–293. PMLR, 2018.
  4. A framework for adversarial streaming via differential privacy and difference estimators. CoRR, abs/2107.14527, 2021.
  5. Counting distinct elements in a data stream. In RANDOM. ACM, 2002.
  6. Algorithmic stability for adaptive data analysis. SIAM J. Comput., 50(3), 2021. doi: 10.1137/16M1103646. URL https://doi.org/10.1137/16M1103646.
  7. Dynamic algorithms against an adaptive adversary: generic constructions and lower bounds. page 1671–1684, 2022. doi: 10.1145/3519935.3520064. URL https://doi.org/10.1145/3519935.3520064.
  8. Adversarially robust streaming via dense-sparse trade-offs. CoRR, abs/2109.03785, 2021a.
  9. A framework for adversarially robust streaming algorithms. SIGMOD Rec., 50(1):6–13, 2021b.
  10. Collusion-secure fingerprinting for digital data. IEEE Trans. Inf. Theory, 44(5):1897–1905, 1998. doi: 10.1109/18.705568. URL https://doi.org/10.1109/18.705568.
  11. Distinct elements in streams: An algorithm for the (text) book. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi: 10.4230/LIPICS.ESA.2022.34. URL https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2022.34.
  12. On adaptive distance estimation. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  13. H. Chernoff. A measure of the asymptotic efficiency for test of a hypothesis based on the sum of observations. Annals of Math. Statistics, 23:493–509, 1952.
  14. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 55:441–453, 1997.
  15. E. Cohen. All-distances sketches, revisited: HIP estimators for massive graphs analysis. TKDE, 2015. URL http://arxiv.org/abs/1306.3284.
  16. Edith Cohen. Min-Hash Sketches, pages 1–7. Springer US, Boston, MA, 2008. ISBN 978-3-642-27848-8. doi: 10.1007/978-3-642-27848-8˙573-1. URL https://doi.org/10.1007/978-3-642-27848-8_573-1.
  17. Edith Cohen. Stream sampling framework and application for frequency cap statistics. ACM Trans. Algorithms, 14(4):52:1–52:40, 2018. ISSN 1549-6325. doi: 10.1145/3234338.
  18. Edith Cohen. Sampling big ideas in query optimization. In Floris Geerts, Hung Q. Ngo, and Stavros Sintos, editors, Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2023, Seattle, WA, USA, June 18-23, 2023, pages 361–371. ACM, 2023. doi: 10.1145/3584372.3589935. URL https://doi.org/10.1145/3584372.3589935.
  19. Sampling sketches for concave sublinear functions of frequencies. In NeurIPS, 2019.
  20. On the robustness of countsketch to adaptive inputs. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 4112–4140. PMLR, 2022a.
  21. Tricking the hashing trick: A tight lower bound on the robustness of CountSketch to adaptive inputs. arXiv:2207.00956, 2022b. doi: 10.48550/ARXIV.2207.00956. URL https://arxiv.org/abs/2207.00956.
  22. Cardinality estimators do not preserve privacy. In Privacy Enhancing Technologies Symposium, volume 19, 2019. doi: https://doi.org/10.2478/popets-2019-0018. URL http://arxiv.org/abs/1808.05879.
  23. M. Durand and P. Flajolet. Loglog counting of large cardinalities (extended abstract). In ESA, 2003.
  24. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.
  25. Preserving statistical validity in adaptive data analysis. In STOC, pages 117–126. ACM, 2015a.
  26. Preserving statistical validity in adaptive data analysis. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, STOC ’15, page 117–126, New York, NY, USA, 2015b. Association for Computing Machinery. ISBN 9781450335362. doi: 10.1145/2746539.2746580. URL https://doi.org/10.1145/2746539.2746580.
  27. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31:182–209, 1985.
  28. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AofA). DMTCS, 2007a.
  29. Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm. Discrete mathematics & theoretical computer science, (Proceedings), 2007b.
  30. David A. Freedman. A note on screening regression equations. The American Statistician, 37(2):152–155, 1983. doi: 10.1080/00031305.1983.10482729. URL https://www.tandfonline.com/doi/abs/10.1080/00031305.1983.10482729.
  31. Sumit Ganguly. Counting distinct items over update streams. Theoretical Computer Science, 378(3):211–222, 2007. ISSN 0304-3975. doi: https://doi.org/10.1016/j.tcs.2007.02.031. URL https://www.sciencedirect.com/science/article/pii/S0304397507001223. Algorithms and Computation.
  32. Minimum cut in o(m log2 n) time. In ICALP, volume 168 of LIPIcs, pages 57:1–57:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
  33. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  34. Google Cloud. BigQuery Documentation: Approximate Aggregate Functions, Accessed: 2024. URL https://cloud.google.com/bigquery/docs/reference/standard-sql/approximate_aggregate_functions. Google Cloud Documentation.
  35. Decremental SSSP in weighted digraphs: Faster and against an adaptive adversary. In Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’20, page 2542–2561, USA, 2020. Society for Industrial and Applied Mathematics.
  36. M. Hardt and J. Ullman. Preventing false discovery in interactive data analysis is hard. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS), pages 454–463. IEEE Computer Society, 2014. doi: 10.1109/FOCS.2014.55. URL https://doi.ieeecomputersociety.org/10.1109/FOCS.2014.55.
  37. How robust are linear sketches to adaptive inputs? In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’13, page 121–130, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450320290. doi: 10.1145/2488608.2488624. URL https://doi.org/10.1145/2488608.2488624.
  38. Adversarially robust streaming algorithms via differential privacy. In Annual Conference on Advances in Neural Information Processing Systems (NeurIPS), 2020.
  39. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In EDBT, 2013.
  40. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annual ACM Symposium on Theory of Computing, pages 604–613. ACM, 1998.
  41. John P. A. Ioannidis. Why most published research findings are false. PLoS Med, (2):8, 2005.
  42. Svante Janson. Tail bounds for sums of geometric and exponential variables, 2017a. URL https://arxiv.org/abs/1709.08157.
  43. Svante Janson. Tail bounds for sums of geometric and exponential variables, 2017b. URL https://arxiv.org/abs/1709.08157.
  44. Towards optimal moment estimation in streaming and distributed models. ACM Trans. Algorithms, 19(3), jun 2023. ISSN 1549-6325. doi: 10.1145/3596494. URL https://doi.org/10.1145/3596494.
  45. An optimal algorithm for the distinct elements problem. In PODS, 2010.
  46. Counting distinct elements under person-level differential privacy. CoRR, abs/2308.12947, 2023. doi: 10.48550/ARXIV.2308.12947. URL https://doi.org/10.48550/arXiv.2308.12947.
  47. D. E. Knuth. The Art of Computer Programming, Vol 2, Seminumerical Algorithms. Addison-Wesley, 2nd edition, 1998.
  48. Privacy-preserving secure cardinality and frequency estimation. Technical report, Google, LLC, 2020.
  49. Model selection bias and Freedman’s paradox. Annals of the Institute of Statistical Mathematics, 62(1):117, 2009.
  50. Sketching in adversarial environments. In Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, STOC ’08, page 651–660, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605580470. doi: 10.1145/1374376.1374471. URL https://doi.org/10.1145/1374376.1374471.
  51. On deterministic sketching and streaming for sparse recovery and norm estimation. Lin. Alg. Appl., 441:152–167, January 2014. Preliminary version in RANDOM 2012.
  52. Efficient differeffentially private F0 linear sketching. In Ke Yi and Zhewei Wei, editors, 24th International Conference on Database Theory (ICDT 2021), volume 186 of Leibniz International Proceedings in Informatics (LIPIcs), pages 18:1–18:19, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. ISBN 978-3-95977-179-5. doi: 10.4230/LIPIcs.ICDT.2021.18. URL https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2021.18.
  53. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017.
  54. Hyperloglog: Exponentially bad in adversarial settings. Cryptology ePrint Archive, Paper 2021/1139, 2021. URL https://eprint.iacr.org/2021/1139. https://eprint.iacr.org/2021/1139.
  55. Information theoretic limits of cardinality estimation: Fisher meets shannon. In Samir Khuller and Virginia Vassilevska Williams, editors, STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21-25, 2021, pages 556–569. ACM, 2021. doi: 10.1145/3406325.3451032. URL https://doi.org/10.1145/3406325.3451032.
  56. Security of hyperloglog (HLL) cardinality estimation: Vulnerabilities and protection. IEEE Commun. Lett., 24(5):976–980, 2020. doi: 10.1109/LCOMM.2020.2972895. URL https://doi.org/10.1109/LCOMM.2020.2972895.
  57. B. Rosén. Asymptotic theory for order sampling. J. Statistical Planning and Inference, 62(2):135–158, 1997.
  58. An on-line edge-deletion problem. J. ACM, 28(1):1–4, jan 1981. ISSN 0004-5411. doi: 10.1145/322234.322235. URL https://doi.org/10.1145/322234.322235.
  59. The flajolet-martin sketch itself preserves differential privacy: Private counting with minimal space. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 19561–19572. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/e3019767b1b23f82883c9850356b71d6-Paper.pdf.
  60. Interactive fingerprinting codes and the hardness of preventing false discovery. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1588–1628, Paris, France, 03–06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Steinke15.html.
  61. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  62. David Wajc. Rounding Dynamic Matchings against an Adaptive Adversary. Association for Computing Machinery, New York, NY, USA, 2020. URL https://doi.org/10.1145/3357713.3384258.
  63. Tight bounds for adversarially robust streams and sliding windows via difference estimators. In Proceedings of the 62nd IEEE Annual Symposium on Foundations of Computer Science (FOCS), 2021.
Citations (2)

Summary

We haven't generated a summary for this paper yet.