Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pb-Hash: Partitioned b-bit Hashing (2306.15944v1)

Published 28 Jun 2023 in cs.LG, cs.DS, and cs.IR

Abstract: Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of $B$ bits. With $k$ hashes for each data vector, the storage would be $B\times k$ bits; and when used for large-scale learning, the model size would be $2B\times k$, which can be expensive. A standard strategy is to use only the lowest $b$ bits out of the $B$ bits and somewhat increase $k$, the number of hashes. In this study, we propose to re-use the hashes by partitioning the $B$ bits into $m$ chunks, e.g., $b\times m =B$. Correspondingly, the model size becomes $m\times 2b \times k$, which can be substantially smaller than the original $2B\times k$. Our theoretical analysis reveals that by partitioning the hash values into $m$ chunks, the accuracy would drop. In other words, using $m$ chunks of $B/m$ bits would not be as accurate as directly using $B$ bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,) $m=2\sim 4$. In some regions, Pb-Hash still works well even for $m$ much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine $m$ embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. An efficient recommendation generation using relevant Jaccard similarity. Information Sciences, 483:53–64, 2019.
  2. Finding text reuse on the web. In Proceedings of the Second International Conference on Web Search and Web Data Mining (WSDM), pages 262–271, Barcelona, Spain, 2009.
  3. A web search engine-based approach to measure semantic similarity between words. IEEE Trans. Knowl. Data Eng., 23(7):977–990, 2011.
  4. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences (SEQUENCES), pages 21–29, Salerno, Italy, 1997.
  5. Syntactic clustering of the web. Comput. Networks, 29(8-13):1157–1166, 1997.
  6. Min-wise independent permutations. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing (STOC), pages 327–336, Dallas, TX, 1998.
  7. A scalable pattern mining approach to web graph compression with communities. In Proceedings of the International Conference on Web Search and Web Data Mining (WSDM), pages 95–106, Stanford, CA, 2008.
  8. Universal classes of hash functions (extended abstract). In Proceedings of the 9th Annual ACM Symposium on Theory of Computing (STOC), pages 106–112, Boulder, CO, 1977.
  9. Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing (STOC), pages 380–388, Montreal, Canada, 2002.
  10. On compressing social networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 219–228, Paris, France, 2009.
  11. Fast computation of min-hash signatures for image collections. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3077–3084, Providence, RI, 2012.
  12. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th International Conference on World Wide Web (WWW), pages 271–280, Banff, Alberta, Canada, 2007.
  13. A data driven approach for person name disambiguation in web search results. In Proceedings of the 25th International Conference on Computational Linguistics (COLING), pages 301–310, Dublin, Ireland, 2014.
  14. Efficient jaccard-based diversity analysis of large document collections. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM), pages 1402–1411, Maui, HI, 2012.
  15. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Theory of Cryptography Conference (TCC), pages 265–284, New York, NY, 2006.
  16. Otmar Ertl. BagMinHash - minwise hashing algorithm for weighted sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1368–1377, London, UK, 2018.
  17. MOBIUS: towards the next generation of query-ad matching in baidu’s sponsored search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 2509–2517, Anchorage, AK, 2019.
  18. Allign: Aligning all-pair near-duplicate passages in long texts. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 541–553, Virtual Event, China, 2021.
  19. A large-scale study of the evolution of web pages. In Proceedings of the Twelfth International World Wide Web Conference (WWW), pages 669–678, Budapest, Hungary, 2003.
  20. Design tradeoffs for data deduplication performance in backup workloads. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST), pages 331–344, Santa Clara, CA, 2015.
  21. Intent-driven similarity in e-commerce listings. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM), pages 2437–2444, Virtual Event, Ireland, 2020.
  22. Exploiting asymmetry in hierarchical topic extraction. In Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management (CIKM), pages 475–482, Arlington, VA, 2006.
  23. An axiomatic approach for result diversification. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 381–390, Madrid, Spain, 2009.
  24. K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2938–2945, 2013.
  25. Sergey Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In Proceedings of the 10th IEEE International Conference on Data Mining (ICDM), pages 246–255, Sydney, Australia, 2010.
  26. Bidirectionally densifying LSH sketches with empty bins. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 830–842, Virtual Event, China, 2021.
  27. Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov random fields. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS), pages 14–23, New York, NY, 1999.
  28. Partition min-hash for partial duplicate image discovery. In Proceedings of the 11th European Conference on Computer Vision (ECCV), Part I, pages 648–662, Heraklion, Crete, Greece, 2010.
  29. Locality-sensitive hashing scheme based on longest circular co-substring. In Proceedings of the 2020 International Conference on Management of Data (SIGMOD), pages 2589–2599, Online conference [Portland, OR, USA], 2020.
  30. Jakub Lemiesz. On the algebra of data sketches. Proc. VLDB Endow., 14(9):1655–1667, 2021.
  31. Using index partitioning and reconciliation for data deduplication, August 18 2015. US Patent 9,110,936.
  32. Ping Li. Linearized GMM kernels and normalized random Fourier features. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 315–324, Halifax, Canada, 2017.
  33. Using sketches to estimate associations. In Proceedings of the Human Language Technology Conference and the Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 708–715, Vancouver, Canada, https://github.com/pltrees/Smallest-K-Sketch, 2005.
  34. b-bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web (WWW), pages 671–680, Raleigh, NC, 2010.
  35. Hashing algorithms for large-scale learning. In Advances in Neural Information Processing Systems (NIPS), pages 2672–2680, Granada, Spain, 2011.
  36. Re-randomized densification for one permutation hashing and bin-wise consistent weighted sampling. In Advances in Neural Information Processing Systems (NeurIPS), pages 15900–15910, Vancouver, Canada, 2019.
  37. Consistent sampling through extremal process. In Proceedings of the Web Conference (WWW), pages 1317–1327, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, 2021.
  38. P-MinHash algorithm for continuous probability measures: Theory and application to machine learning. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM), Atlanta, GA, 2022.
  39. Rejection sampling for weighted jaccard similarity revisited. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), Virtual Event, 2021.
  40. C-MinHash: Improving minwise hashing with circulant permutation. In Proceedings of the International Conference on Machine Learning (ICML), pages 12857–12887, Baltimore, MD, 2022.
  41. Differentially private one permutation hashing and bin-wise consistent weighted sampling. arXiv preprint, 2023.
  42. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.
  43. Fast memory-efficient anomaly detection in streaming heterogeneous graphs. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1035–1044, San Francisco, CA, 2016.
  44. Table union search on open data. Proc. VLDB Endow., 11(7):813–825, 2018.
  45. Nearest-neighbor caching for content-match applications. In Proceedings of the 18th International Conference on World Wide Web (WWW), pages 441–450, Madrid, Spain, 2009.
  46. Cross-architecture bug search in binary executables. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP), pages 709–724, San Jose, CA, 2015.
  47. Variance reduction in bipartite experiments through correlation clustering. In Advances in Neural Information Processing Systems (NeurIPS), pages 13288–13298, Vancouver, Canada, 2019.
  48. An alternative to NCD for large sequences, lempel-ziv jaccard distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1007–1015, Halifax, Canada, 2017.
  49. SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 871–880, New York, NY, 2014.
  50. On b-bit min-wise hashing for large-scale regression and classification with sparse data. J. Mach. Learn. Res., 18:178:1–178:42, 2017.
  51. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pages 165–175, Virtual Event, CA, 2020.
  52. Anshumali Shrivastava. Simple and efficient weighted minwise hashing. In Neural Information Processing Systems (NIPS), pages 1498–1506, Barcelona, Spain, 2016.
  53. In defense of minhash over simhash. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 886–894, Reykjavik, Iceland, 2014.
  54. Guilt by association: large scale malware detection by mining file-relation graphs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1524–1533, New York, NY, 2014.
  55. Cross-pair text representations for answer sentence selection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2162–2173, Brussels, Belgium, 2018.
  56. Learning fine-grained image similarity with deep ranking. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1386–1393, Columbus, OH, 2014.
  57. A memory-efficient sketch method for estimating high similarities in streaming sets. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 25–33, Anchorage, AK, 2019.
  58. NodeSketch: Highly-efficient graph embeddings via recursive sketching. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), pages 1162–1172, Anchorage, AK, 2019.
  59. Hyperminhash: Minhash in loglog space. IEEE Trans. Knowl. Data Eng., 34(1):328–339, 2022.
  60. AIBox: CTR prediction model training on a single node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM), pages 319–328, Beijing, China, 2019.
  61. Building k-anonymous user cohorts with consecutive consistent weighted sampling (ccws). In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Taipei, 2023.
  62. Interactive navigation of open data linkages. Proc. VLDB Endow., 10(12):1837–1840, 2017.
  63. JOSIE: overlap set similarity search for finding joinable tables in data lakes. In Proceedings of the 2019 International Conference on Management of Data (SIGMOD), pages 847–864, Amsterdam, The Netherlands, 2019.

Summary

We haven't generated a summary for this paper yet.