Xling: A Learned Filter Framework for Accelerating High-Dimensional Approximate Similarity Join (2402.13397v1)
Abstract: Similarity join finds all pairs of close points within a given distance threshold. Many similarity join methods have been proposed, but they are usually not efficient on high-dimensional space due to the curse of dimensionality and data-unawareness. We investigate the possibility of using metric space Bloom filter (MSBF), a family of data structures checking if a query point has neighbors in a multi-dimensional space, to speed up similarity join. However, there are several challenges when applying MSBF to similarity join, including excessive information loss, data-unawareness and hard constraint on the distance metric. In this paper, we propose Xling, a generic framework to build a learning-based metric space filter with any existing regression model, aiming at accurately predicting whether a query point has enough number of neighbors. The framework provides a suite of optimization strategies to further improve the prediction quality based on the learning model, which has demonstrated significantly higher prediction quality than existing MSBF. We also propose XJoin, one of the first filter-based similarity join methods, based on Xling. By predicting and skipping those queries without enough neighbors, XJoin can effectively reduce unnecessary neighbor searching and therefore it achieves a remarkable acceleration. Benefiting from the generalization capability of deep learning models, XJoin can be easily transferred onto new dataset (in similar distribution) without re-training. Furthermore, Xling is not limited to being applied in XJoin, instead, it acts as a flexible plugin that can be inserted to any loop-based similarity join methods for a speedup.
- B. Gyawali, L. Anastasiou, and P. Knoth, “Deduplication of scholarly documents using locality sensitive hashing and word embeddings,” in Proceedings of the Twelfth Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 901–910. [Online]. Available: https://aclanthology.org/2020.lrec-1.113
- C. Yang, D. H. Hoang, T. Mikolov, and J. Han, “Place deduplication with embeddings,” in The World Wide Web Conference, 2019, pp. 3420–3426.
- H. B. da Silva, Z. K. do Patrocínio, G. Gravier, L. Amsaleg, A. d. A. Araújo, and S. J. F. Guimaraes, “Near-duplicate video detection based on an approximate similarity self-join strategy,” in 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI). IEEE, 2016, pp. 1–6.
- L. Zhou, J. Chen, A. Das, H. Min, L. Yu, M. Zhao, and J. Zou, “Serving deep learning models with deduplication from relational databases,” Proceedings of the VLDB Endowment, vol. 15, no. 10, p. 2230–2243, Jun. 2022. [Online]. Available: http://dx.doi.org/10.14778/3547305.3547325
- R. Sarwar, C. Yu, N. Tungare, K. Chitavisutthivong, S. Sriratanawilai, Y. Xu, D. Chow, T. Rakthanmanon, and S. Nutanong, “An effective and scalable framework for authorship attribution query processing,” IEEE Access, vol. 6, pp. 50 030–50 048, 2018.
- B. Hättasch, M. Truong-Ngoc, A. Schmidt, and C. Binnig, “It’s ai match: A two-step approach for schema matching using embeddings,” 2022. [Online]. Available: https://arxiv.org/abs/2203.04366
- N. Adly, “Efficient record linkage using a double embedding scheme.” in DMIN, 2009, pp. 274–281.
- S. Herath, M. Roughan, and G. Glonek, “Em-k indexing for approximate query matching in large-scale er,” 2021.
- F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller, “Table union search on open data,” Proc. VLDB Endow., vol. 11, no. 7, p. 813–825, mar 2018. [Online]. Available: https://doi.org/10.14778/3192965.3192973
- A. Berenguer, J.-N. Mazón, and D. Tomás, “Towards a tabular open data search engine for public sector information,” in 2021 IEEE International Conference on Big Data (Big Data), 2021, pp. 5851–5853.
- Y. Dong, K. Takeoka, C. Xiao, and M. Oyamada, “Efficient joinable table discovery in data lakes: A high-dimensional similarity-based approach,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021, pp. 456–467.
- X. Yuan, X. Wang, C. Wang, C. Yu, and S. Nutanong, “Privacy-preserving similarity joins over encrypted data,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 11, pp. 2763–2775, 2017.
- J. Yao, X. Meng, Y. Zheng, and C. Wang, “Privacy-preserving content-based similarity detection over in-the-cloud middleboxes,” IEEE Transactions on Cloud Computing, vol. 11, no. 2, pp. 1854–1870, 2023.
- M. Perdacher, C. Plant, and C. Böhm, “Cache-oblivious high-performance similarity join,” in Proceedings of the 2019 International Conference on Management of Data, ser. SIGMOD ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 87–104. [Online]. Available: https://doi.org/10.1145/3299869.3319859
- C. Böhm, B. Braunmüller, F. Krebs, and H.-P. Kriegel, “Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data,” in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’01. New York, NY, USA: Association for Computing Machinery, 2001, p. 379–388. [Online]. Available: https://doi.org/10.1145/375663.375714
- D. V. Kalashnikov and S. Prabhakar, “Fast similarity join for multi-dimensional data,” Information Systems, vol. 32, no. 1, pp. 160–177, 2007. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0306437905000761
- D. V. Kalashnikov, “Super-ego: fast multi-dimensional similarity join,” The VLDB Journal, vol. 22, no. 4, pp. 561–585, 2013.
- C. Yu, S. Nutanong, H. Li, C. Wang, and X. Yuan, “A generic method for accelerating lsh-based similarity join processing (extended abstract),” in 2017 IEEE 33rd International Conference on Data Engineering (ICDE), 2017, pp. 29–30.
- H. Li, S. Nutanong, H. Xu, c. YU, and F. Ha, “C2net: A network-efficient approach to collision counting lsh similarity join,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 3, pp. 423–436, 2019.
- Z. Yang, W. T. Ooi, and Q. Sun, “Hierarchical, non-uniform locality sensitive hashing and its application to video identification,” in 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), vol. 1, 2004, pp. 743–746 Vol.1.
- Y. Hua, B. Xiao, B. Veeravalli, and D. Feng, “Locality-sensitive bloom filter for approximate membership query,” IEEE Transactions on Computers, vol. 61, no. 6, pp. 817–830, 2012.
- J. Qian, Q. Zhu, and H. Chen, “Integer-granularity locality-sensitive bloom filter,” IEEE Communications Letters, vol. 20, no. 11, pp. 2125–2128, 2016.
- M. Goswami, R. Pagh, F. Silvestri, and J. Sivertsen, “Distance sensitive bloom filters without false negatives,” 2016.
- J. Qian, Z. Huang, Q. Zhu, and H. Chen, “Hamming metric multi-granularity locality-sensitive bloom filter,” IEEE/ACM Trans. Netw., vol. 26, no. 4, p. 1660–1673, aug 2018. [Online]. Available: https://doi.org/10.1109/TNET.2018.2850536
- J. Qian, Q. Zhu, and H. Chen, “Multi-granularity locality-sensitive bloom filter,” IEEE Transactions on Computers, vol. 64, no. 12, pp. 3500–3514, 2015.
- A. Kirsch and M. Mitzenmacher, “Distance-sensitive bloom filters,” in 2006 Proceedings of the Eighth Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 2006, pp. 41–50.
- Y. Hua, X. Liu, Y. Hua, and X. Liu, “Locality-sensitive bloom filter for approximate membership query,” Searchable Storage in Cloud Computing, pp. 99–127, 2019.
- T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The case for learned index structures,” 2018.
- S. Macke, A. Beutel, T. Kraska, M. Sathiamoorthy, D. Z. Cheng, and E. H. Chi, “Lifting the curse of multidimensional data with learned existence indexes,” in Workshop on ML for Systems at NeurIPS, 2018, pp. 1–6.
- M. Mitzenmacher, “A model for learned bloom filters and optimizing by sandwiching,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- A. Bhattacharya, C. Gudesa, A. Bagchi, and S. Bedathur, “New wine in an old bottle: Data-aware hash functions for bloom filters,” Proc. VLDB Endow., vol. 15, no. 9, p. 1924–1936, may 2022. [Online]. Available: https://doi.org/10.14778/3538598.3538613
- J. Sun, G. Li, and N. Tang, “Learned cardinality estimation for similarity queries,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 1745–1757.
- Y. Wang, C. Xiao, J. Qin, X. Cao, Y. Sun, W. Wang, and M. Onizuka, “Monotonic cardinality estimation of similarity selection: A deep learning approach,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020, pp. 1197–1212.
- J. Qin, W. Wang, C. Xiao, Y. Zhang, and Y. Wang, “High-dimensional similarity query processing for data science,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ser. KDD ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 4062–4063. [Online]. Available: https://doi.org/10.1145/3447548.3470811
- Y. Wang, C. Xiao, J. Qin, R. Mao, M. Onizuka, W. Wang, R. Zhang, and Y. Ishikawa, “Consistent and flexible selectivity estimation for high-dimensional data,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 2319–2327.
- T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis, “The case for learned index structures,” in Proceedings of the 2018 international conference on management of data, 2018, pp. 489–504.
- H. Zhang and Q. Zhang, “Embedjoin: Efficient edit similarity joins via embeddings,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 585–594. [Online]. Available: https://doi.org/10.1145/3097983.3098003
- Y. Wang and D. Z. Wang, “Learned accelerator framework for angular-distance-based high-dimensional dbscan,” 2023.
- Y. Wang, C. Xiao, J. Qin, X. Cao, Y. Sun, W. Wang, and M. Onizuka, “Monotonic cardinality estimation of similarity selection: A deep learning approach,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, ser. SIGMOD ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1197–1212. [Online]. Available: https://doi.org/10.1145/3318464.3380570
- S. Macke, A. Beutel, T. Kraska, M. Sathiamoorthy, D. Z. Cheng, and E. H. Chi, “Lifting the curse of multidimensional data with learned existence indexes,” 2018.
- A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, “Practical and optimal lsh for angular distance,” Advances in neural information processing systems, vol. 28, 2015.
- M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration.” VISAPP (1), vol. 2, no. 331-340, p. 2, 2009.
- H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 1, pp. 117–128, 2010.
- J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with gpus,” arXiv preprint arXiv:1702.08734, 2017.
- Y. Wang, H. Ma, and D. Z. Wang, “Lider: An efficient high-dimensional learned index for large-scale dense passage retrieval,” 2022. [Online]. Available: https://arxiv.org/abs/2205.00970
- L. V. Nguyen, T.-H. Nguyen, and J. J. Jung, “Content-based collaborative filtering using word embedding: A case study on movie recommendation,” in Proceedings of the International Conference on Research in Adaptive and Convergent Systems, ser. RACS ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 96–100. [Online]. Available: https://doi.org/10.1145/3400286.3418253