LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries (2403.07331v3)
Abstract: With the proliferation of spatio-textual data, Top-k KNN spatial keyword queries (TkQs), which return a list of objects based on a ranking function that considers both spatial and textual relevance, have found many real-life applications. To efficiently handle TkQs, many indexes have been developed, but the effectiveness of TkQ is limited. To improve effectiveness, several deep learning models have recently been proposed, but they suffer severe efficiency issues and there are no efficient indexes specifically designed to accelerate the top-k search process for these deep learning models. To tackle these issues, we consider embedding based spatial keyword queries, which capture the semantic meaning of query keywords and object descriptions in two separate embeddings to evaluate textual relevance. Although various models can be used to generate these embeddings, no indexes have been specifically designed for such queries. To fill this gap, we propose LIST, a novel machine learning based Approximate Nearest Neighbor Search index that Learns to Index the Spatio-Textual data. LIST utilizes a new learning-to-cluster technique to group relevant queries and objects together while separating irrelevant queries and objects. There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced clustering results. We develop a novel pseudo-label generation technique to address the two challenges. Additionally, we introduce a learning based spatial relevance model that can integrates with various text relevance models to form a lightweight yet effective relevance for reranking objects retrieved by LIST.
- Y. Chen, X. Li, G. Cong, C. Long, Z. Bao, S. Liu, W. Gu, and F. Zhang, “Points-of-interest relationship inference with spatial-enriched graph neural networks,” Proceedings of the VLDB Endowment, vol. 15, no. 3, pp. 504–512, 2021.
- G. Cong, C. S. Jensen, and D. Wu, “Efficient retrieval of the top-k most relevant spatial web objects,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 337–348, 2009.
- Y. Chen, T. Suel, and A. Markowetz, “Efficient query processing in geographic web search engines,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27-29, 2006. ACM, 2006, pp. 277–288.
- A. Cary, O. Wolfson, and N. Rishe, “Efficient and scalable method for processing top-k spatial boolean queries,” in Scientific and Statistical Database Management, 22nd International Conference, SSDBM 2010, Heidelberg, Germany, June 30 - July 2, 2010. Proceedings, ser. Lecture Notes in Computer Science, vol. 6187. Springer, 2010, pp. 87–95.
- S. Liu, G. Cong, K. Feng, W. Gu, and F. Zhang, “Effectiveness perspectives and a deep relevance model for spatial keyword queries,” Proceedings of the ACM on Management of Data, vol. 1, no. 1, pp. 1–25, 2023.
- Z. Chen, L. Chen, G. Cong, and C. S. Jensen, “Location- and keyword-based querying of geo-textual data: a survey,” VLDB J., vol. 30, no. 4, pp. 603–640, 2021.
- I. D. Felipe, V. Hristidis, and N. Rishe, “Keyword search on spatial databases,” in Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Cancún, Mexico. IEEE Computer Society, 2008, pp. 656–665.
- Z. Li, K. C. K. Lee, B. Zheng, W. Lee, D. L. Lee, and X. Wang, “Ir-tree: An efficient index for geographic document search,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 4, pp. 585–599, 2011.
- J. B. Rocha-Junior, O. Gkorgkas, S. Jonassen, and K. Nørvåg, “Efficient processing of top-k spatial keyword queries,” in Advances in Spatial and Temporal Databases - 12th International Symposium, SSTD 2011, Minneapolis, MN, USA, August 24-26, 2011, Proceedings, ser. Lecture Notes in Computer Science, vol. 6849. Springer, 2011, pp. 205–222.
- S. E. Robertson and H. Zaragoza, “The probabilistic relevance framework: BM25 and beyond,” Found. Trends Inf. Retr., vol. 3, no. 4, pp. 333–389, 2009.
- H. C. Wu, R. W. P. Luk, K. Wong, and K. Kwok, “Interpreting TF-IDF term weights as making relevance decisions,” ACM Trans. Inf. Syst., vol. 26, no. 3, pp. 13:1–13:37, 2008.
- R. Ding, B. Chen, P. Xie, F. Huang, X. Li, Q. Zhang, and Y. Xu, “Mgeo: Multi-modal geographic language model pre-training,” in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2023, Taipei, Taiwan, July 23-27, 2023, H.-H. Chen, W.-J. E. Duh, H.-H. Huang, M. P. Kato, J. Mothe, and B. Poblete, Eds. ACM, 2023, pp. 185–194.
- J. Zhao, D. Peng, C. Wu, H. Chen, M. Yu, W. Zheng, L. Ma, H. Chai, J. Ye, and X. Qie, “Incorporating semantic similarity with geographic correlation for query-poi relevance learning,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 2019, pp. 1270–1277.
- N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, “The r*-tree: An efficient and robust access method for points and rectangles,” in Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, USA, May 23-25, 1990. ACM Press, 1990, pp. 322–331.
- J. Qin, W. Wang, C. Xiao, Y. Zhang, and Y. Wang, “High-dimensional similarity query processing for data science,” in KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021. ACM, 2021, pp. 4062–4063.
- J. Wang, T. Zhang, J. Song, N. Sebe, and H. T. Shen, “A survey on learning to hash,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 769–790, 2018.
- M. Wang, X. Xu, Q. Yue, and Y. Wang, “A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search,” Proceedings of the VLDB Endowment, vol. 14, no. 11, pp. 1964–1978, 2021.
- Y.-C. Hsu and Z. Kira, “Neural network-based clustering using pairwise constraints,” vol. abs/1511.06321, 2015.
- Y. Hsu, Z. Lv, and Z. Kira, “Learning to cluster in order to transfer across domains and tasks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
- C. Zhang, Y. Zhang, W. Zhang, and X. Lin, “Inverted linear quadtree: Efficient top k spatial keyword search,” in 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013. IEEE Computer Society, 2013, pp. 901–912.
- Y. Tao and C. Sheng, “Fast nearest neighbor search with keywords,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 4, pp. 878–888, 2014.
- S. Vaid, C. B. Jones, H. Joho, and M. Sanderson, “Spatio-textual indexing for geographical search on the web,” in Advances in Spatial and Temporal Databases, 9th International Symposium, SSTD 2005, Angra dos Reis, Brazil, August 22-24, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3633. Springer, 2005, pp. 218–235.
- R. Göbel, A. Henrich, R. Niemann, and D. Blank, “A hybrid index structure for geo-textual searches,” in Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2-6, 2009. ACM, 2009, pp. 1625–1628.
- Y. Zhou, X. Xie, C. Wang, Y. Gong, and W. Ma, “Hybrid index structures for location-based web search,” in Proceedings of the 2005 ACM CIKM International Conference on Information and Knowledge Management, Bremen, Germany, October 31 - November 5, 2005. ACM, 2005, pp. 155–162.
- J. Lu, Y. Lu, and G. Cong, “Reverse spatial and textual k nearest neighbor search,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, June 12-16, 2011. ACM, 2011, pp. 349–360.
- D. Zhang, K. Tan, and A. K. H. Tung, “Scalable top-k spatial keyword search,” in Joint 2013 EDBT/ICDT Conferences, EDBT ’13 Proceedings, Genoa, Italy, March 18-22, 2013. ACM, 2013, pp. 359–370.
- D. Zhang, B. C. Ooi, and A. K. H. Tung, “Locating mapped resources in web 2.0,” in Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1-6, 2010, Long Beach, California, USA. IEEE Computer Society, 2010, pp. 521–532.
- D. Zhang, Y. M. Chee, A. Mondal, A. K. H. Tung, and M. Kitsuregawa, “Keyword search in spatial databases: Towards searching by document,” in Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China. IEEE Computer Society, 2009, pp. 688–699.
- Y. Sheng, X. Cao, Y. Fang, K. Zhao, J. Qi, G. Cong, and W. Zhang, “WISK: A workload-aware learned index for spatial keyword queries,” Proc. ACM Manag. Data, vol. 1, no. 2, pp. 187:1–187:27, 2023.
- Y. Liu and A. Magdy, “U-ASK: a unified architecture for knn spatial-keyword queries supporting negative keyword predicates,” in Proceedings of the 30th International Conference on Advances in Geographic Information Systems, SIGSPATIAL 2022, Seattle, Washington, November 1-4, 2022. ACM, 2022, pp. 40:1–40:11.
- Z. Qian, J. Xu, K. Zheng, P. Zhao, and X. Zhou, “Semantic-aware top-k spatial keyword queries,” World Wide Web, vol. 21, no. 3, pp. 573–594, 2018.
- X. Chen, J. Xu, R. Zhou, P. Zhao, C. Liu, J. Fang, and L. Zhao, “S22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTr-tree: a pivot-based indexing structure for semantic-aware spatial keyword search,” GeoInformatica, vol. 24, no. 1, pp. 3–25, 2020.
- P. Indyk and R. Motwani, “Approximate nearest neighbors: Towards removing the curse of dimensionality,” in Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing - STOC ’98, 1998, pp. 604–613.
- J. Guo, Y. Fan, L. Pang, L. Yang, Q. Ai, H. Zamani, C. Wu, W. B. Croft, and X. Cheng, “A deep look into neural ranking models for information retrieval,” Information Processing & Management, vol. 57, no. 6, p. 102067, 2020.
- B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional neural network architectures for matching natural language sentences,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014, pp. 2042–2050.
- L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng, “Text matching as image recognition,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, 2016, pp. 2793–2799.
- P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, “Learning deep structured semantic models for web search using clickthrough data,” in Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management - CIKM ’13, 2013, pp. 2333–2338.
- V. Karpukhin, B. Oguz, S. Min, P. S. H. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, 2020, pp. 6769–6781.
- K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, “Retrieval augmented language model pre-training,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 3929–3938.
- J. Guo, Y. Cai, Y. Fan, F. Sun, R. Zhang, and X. Cheng, “Semantic models for the first-stage retrieval: A comprehensive review,” ACM Trans. Inf. Syst., vol. 40, no. 4, pp. 66:1–66:42, 2022.
- W. X. Zhao, J. Liu, R. Ren, and J. Wen, “Dense text retrieval based on pretrained language models: A survey,” CoRR, vol. abs/2211.14876, 2022.
- K. Zhou, X. Liu, Y. Gong, W. X. Zhao, D. Jiang, N. Duan, and J. Wen, “MASTER: multi-task pre-trained bottlenecked masked autoencoders are better dense retrievers,” in Machine Learning and Knowledge Discovery in Databases: Research Track - European Conference, ECML PKDD 2023, Turin, Italy, September 18-22, 2023, Proceedings, Part II, ser. Lecture Notes in Computer Science, vol. 14170. Springer, 2023, pp. 630–647.
- X. Wen, X. Chen, X. Chen, B. He, and L. Sun, “Offline pseudo relevance feedback for efficient and effective single-pass dense retrieval,” in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 2209–2214.
- Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao, D. Dong, H. Wu, and H. Wang, “Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021. Association for Computational Linguistics, 2021, pp. 5835–5847.
- A. Babenko and V. S. Lempitsky, “The inverted multi-index,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 6, pp. 1247–1260, 2015.
- H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011.
- R. Wang and D. Deng, “Deltapq: lossless product quantization code compression for high dimensional similarity search,” Proceedings of the VLDB Endowment, vol. 13, no. 13, pp. 3603–3616, 2020.
- X. Luo, H. Wang, D. Wu, C. Chen, M. Deng, J. Huang, and X.-S. Hua, “A survey on deep hashing methods,” ACM Transactions on Knowledge Discovery from Data, vol. 17, no. 1, pp. 1–50, 2023.
- Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen, “Sk-lsh: an efficient index structure for approximate nearest neighbor search,” Proceedings of the VLDB Endowment, vol. 7, no. 9, pp. 745–756, 2014.
- W. Liu, H. Wang, Y. Zhang, W. Wang, and L. Qin, “I-lsh: I/o efficient c-approximate nearest neighbor search in high-dimensional space,” in 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE, 2019, pp. 1670–1673.
- B. Zheng, Z. Xi, L. Weng, N. Q. V. Hung, H. Liu, and C. S. Jensen, “Pm-lsh: A fast and accurate lsh framework for high-dimensional approximate nn search,” Proceedings of the VLDB Endowment, vol. 13, no. 5, pp. 643–655, 2020.
- Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, 2020.
- J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” CoRR, vol. abs/1408.2927, 2014.
- T. Ge, K. He, Q. Ke, and J. Sun, “Optimized product quantization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp. 744–755, 2014.
- L. Gao, X. Zhu, J. Song, Z. Zhao, and H. T. Shen, “Beyond product quantization: Deep progressive quantization for image retrieval,” in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. ijcai.org, 2019, pp. 723–729.
- R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar, “Accelerating large-scale inference with anisotropic vector quantization,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 3887–3896.
- Z. Yuan, H. Liu, Y. Liu, D. Zhang, F. Yi, N. Zhu, and H. Xiong, “Spatio-temporal dual graph attention network for query-poi matching,” in Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020, Virtual Event, China, July 25-30, 2020. ACM, 2020, pp. 629–638.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 2019, pp. 4171–4186.
- S.-C. Lin, J.-H. Yang, and J. Lin, “In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval,” in Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). Association for Computational Linguistics, 2021, pp. 163–173.
- D. Paul, F. Li, and J. M. Phillips, “Semantic embedding for regions of interest,” VLDB J., vol. 30, no. 3, pp. 311–331, 2021.
- R. Ren, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu, H. Wang, and J. Wen, “Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021. Association for Computational Linguistics, 2021, pp. 2825–2835.
- Y. Hsu, Z. Lv, J. Schlosser, P. Odom, and Z. Kira, “Multi-class classification without multi-class labels,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
- J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535–547, 2019.
- X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua, “Neural collaborative filtering,” in Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017. ACM, 2017, pp. 173–182.
- D. Li, R. Ding, Q. Zhang, Z. Li, B. Chen, P. Xie, Y. Xu, X. Li, N. Guo, F. Huang, and X. He, “Geoglue: A geographic language understanding evaluation benchmark,” CoRR, vol. abs/2305.06545, 2023.
- T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew, “Huggingface’s transformers: State-of-the-art natural language processing,” CoRR, vol. abs/1910.03771, 2019.
- Ziqi Yin (9 papers)
- Shanshan Feng (30 papers)
- Shang Liu (68 papers)
- Gao Cong (54 papers)
- Yew Soon Ong (30 papers)
- Bin Cui (165 papers)