Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Maintaining $k$-MinHash Signatures over Fully-Dynamic Data Streams with Recovery (2407.21614v1)

Published 31 Jul 2024 in cs.DS

Abstract: We consider the task of performing Jaccard similarity queries over a large collection of items that are dynamically updated according to a streaming input model. An item here is a subset of a large universe $U$ of elements. A well-studied approach to address this important problem in data mining is to design fast-similarity data sketches. In this paper, we focus on global solutions for this problem, i.e., a single data structure which is able to answer both Similarity Estimation and All-Candidate Pairs queries, while also dynamically managing an arbitrary, online sequence of element insertions and deletions received in input. We introduce and provide an in-depth analysis of a dynamic, buffered version of the well-known $k$-MinHash sketch. This buffered version better manages critical update operations thus significantly reducing the number of times the sketch needs to be rebuilt from scratch using expensive recovery queries. We prove that the buffered $k$-MinHash uses $O(k \log |U|)$ memory words per subset and that its amortized update time per insertion/deletion is $O(k \log |U|)$ with high probability. Moreover, our data structure can return the $k$-MinHash signature of any subset in $O(k)$ time, and this signature is exactly the same signature that would be computed from scratch (and thus the quality of the signature is the same as the one guaranteed by the static $k$-MinHash). Analytical and experimental comparisons with the other, state-of-the-art global solutions for this problem given in [Bury et al.,WSDM'18] show that the buffered $k$-MinHash turns out to be competitive in a wide and relevant range of the online input parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Group formation in large social networks: membership, growth, and evolution. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 44–54, New York, NY, USA, 2006. Association for Computing Machinery.
  2. Modern information retrieval, volume 463. ACM press New York, 1999.
  3. Foundations of data science. Cambridge University Press, 2020.
  4. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE, 1997.
  5. Andrei Z Broder. Identifying and filtering near-duplicate documents. In Annual symposium on combinatorial pattern matching, pages 1–10. Springer, 2000.
  6. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630–659, 2000.
  7. Similarity search for dynamic data streams. IEEE Transactions on Knowledge and Data Engineering, 32:2241–2253, 2020.
  8. Real-time community detection in full social networks on a laptop. PLOS ONE, 13(1):1–37, 01 2018.
  9. Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380–388, 2002.
  10. Finding interesting associations without support pruning. In Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), pages 489–500, 2000.
  11. Summarizing data using bottom-k sketches. In ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, 2007.
  12. Introduction to Algorithms. The MIT Press, 2nd edition, 2001.
  13. Fast similarity sketching. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 663–671. IEEE, 2017.
  14. Amazon DynamoDB: A scalable, predictably performant, and fully managed NoSQL database service. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 1037–1048, Carlsbad, CA, July 2022. USENIX Association.
  15. Exponential time improvement for min-wise based algorithms. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 57–66. SIAM, 2011.
  16. dk-min-wise independent family of hash functions. Journal of Computer and System Sciences, 84:171–184, 2017.
  17. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory Comput., 8:321–350, 2012.
  18. Online aggregation. In ACM SIGMOD Conference, 1997.
  19. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604–613, 1998.
  20. A fast sketch method for mining user similarities over fully dynamic graph streams. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1682–1685, 2019.
  21. A fast sketch method for mining user similarities over fully dynamic graph streams. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1682–1685. IEEE, 2019.
  22. A comprehensive study of elastic search. Journal of Research in Science and Engineering, 2022.
  23. Information retrieval on the web. ACM computing surveys (CSUR), 32(2):144–173, 2000.
  24. Google’s PageRank and beyond: The science of search engine rankings. Princeton university press, 2006.
  25. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6, 11 2008.
  26. Mining of massive data sets. Cambridge university press, 2020.
  27. One permutation hashing. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012.
  28. Udi Manber et al. Finding similar files in a large file system. In Usenix winter, volume 94, pages 1–10, 1994.
  29. Anf: a fast and scalable tool for data mining in massive graphs. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.
  30. A similarity-based approach for efficient large quasi-clique detection. In Proceedings of the ACM on Web Conference 2024, pages 401–409, 2024.
  31. Data mining using client/server systems. Journal of Systems and Information Technology, 4:72–82, 2000.
  32. Nicola Prezza. Algorithms for massive data – lecture notes, 2024.
  33. The power of simple tabulation hashing. J. ACM, 59(3), jun 2012.
  34. Robert Endre Tarjan. Amortized computational complexity. SIAM Journal on Algebraic Discrete Methods, 6(2):306–318, 1985.
  35. Mikkel Thorup. Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, STOC ’13, page 371–380, New York, NY, USA, 2013. Association for Computing Machinery.
  36. Multi-resolution odd sketch for mining extended jaccard similarity of dynamic streaming sets. IEEE Transactions on Network Science and Engineering, pages 1–15, 2023.
  37. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, MDS ’12, New York, NY, USA, 2012. Association for Computing Machinery.
  38. Albert L. Zobrist. A new hashing method with application for game playing. ICGA Journal, 13:69–73, 1990.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com