Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor Search (2404.11731v1)

Published 17 Apr 2024 in cs.IR

Abstract: A critical piece of the modern information retrieval puzzle is approximate nearest neighbor search. Its objective is to return a set of $k$ data points that are closest to a query point, with its accuracy measured by the proportion of exact nearest neighbors captured in the returned set. One popular approach to this question is clustering: The indexing algorithm partitions data points into non-overlapping subsets and represents each partition by a point such as its centroid. The query processing algorithm first identifies the nearest clusters -- a process known as routing -- then performs a nearest neighbor search over those clusters only. In this work, we make a simple observation: The routing function solves a ranking problem. Its quality can therefore be assessed with a ranking metric, making the function amenable to learning-to-rank. Interestingly, ground-truth is often freely available: Given a query distribution in a top-$k$ configuration, the ground-truth is the set of clusters that contain the exact top-$k$ vectors. We develop this insight and apply it to Maximum Inner Product Search (MIPS). As we demonstrate empirically on various datasets, learning a simple linear function consistently improves the accuracy of clustering-based MIPS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Clustering is Efficient for Approximate Maximum Inner Product Search. arXiv:1507.05910 [cs.LG]
  2. Jon Louis Bentley. 1975. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 18, 9 (9 1975), 509–517.
  3. Sebastian Bruch. 2021. An Alternative Cross Entropy Loss for Learning-to-Rank. In Proceedings of the Web Conference 2021 (Ljubljana, Slovenia). 118–126.
  4. Sebastian Bruch. 2024. Foundations of Vector Retrieval. Springer Nature Switzerland.
  5. Efficient and Effective Tree-based and Neural Learning to Rank. Foundations and Trends in Information Retrieval 17, 1 (2023), 1–123.
  6. Bridging Dense and Sparse Maximum Inner Product Search. arXiv:2309.09013 [cs.IR]
  7. An Analysis of the Softmax Cross Entropy Loss for Learning-to-Rank with Binary Relevance. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (Santa Clara, CA, USA). 75–78.
  8. Revisiting Approximate Metric Optimization in the Age of Deep Neural Networks. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.
  9. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning. 89–96.
  10. Christopher J.C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Technical Report MSR-TR-2010-82. Microsoft Research.
  11. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning. 129–136.
  12. Finding near Neighbors through Cluster Pruning. In Proceedings of the 26th ACM SIGMOD Symposium on Principles of Database Systems (Beijing, China). 103–112.
  13. Sanjoy Dasgupta and Kaushik Sinha. 2015. Randomized Partition Trees for Nearest Neighbor Search. Algorithmica 72, 1 (5 2015), 237–263.
  14. Inderjit S. Dhillon and Dharmendra S. Modha. 1999. Concept Decompositions for Large Sparse Text Data using Clustering. Technical Report RJ 10147. Array.
  15. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, Canada). 113–122.
  16. Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (Dallas, Texas, USA). 604–613.
  17. Unsupervised Dense Information Retrieval with Contrastive Learning. Transactions on Machine Learning Research (2022).
  18. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems 20, 4 (2002), 422–446.
  19. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Advances in Neural Information Processing Systems, Vol. 32.
  20. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (2011), 117–128.
  21. Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 217–226.
  22. Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG]
  23. Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 4 (4 2020), 824–836.
  24. Quinn McNemar. 1947. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12 (6 1947), 153–157.
  25. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. (November 2016).
  26. A general approximation framework for direct optimization of information retrieval measures. Information Retrieval 13, 4 (2010), 375–397.
  27. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992.
  28. Okapi at TREC-3.. In TREC (NIST Special Publication, Vol. 500-225), Donna K. Harman (Ed.). National Institute of Standards and Technology (NIST), 109–126.
  29. Cynthia Rudin. 2009. The P-Norm Push: A Simple Convex Ranking Algorithm That Concentrates at the Top of the List. Journal of Machine Learning Research 10 (Dec. 2009), 2233–2271.
  30. Cynthia Rudin and Yining Wang. 2018. Direct Learning to Rank and Rerank. In Proceedings of Artificial Intelligence and Statistics AISTATS.
  31. SoftRank: Optimizing Non-smooth Rank Metrics. In Proceedings of the 1st International Conference on Web Search and Data Mining. 77–86.
  32. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 809–819.
  33. Efficient Query Processing for Scalable Web Search. Foundations and Trends in Information Retrieval 12, 4–5 (Dec 2018), 319–500.
  34. Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th International Conference on Machine Learning. 1192–1199.
  35. Jun Xu and Hang Li. 2007. AdaRank: A Boosting Algorithm for Information Retrieval. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 391–398.
  36. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2369–2380.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Thomas Vecchiato (2 papers)
  2. Claudio Lucchese (22 papers)
  3. Franco Maria Nardini (29 papers)
  4. Sebastian Bruch (17 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.