Fast Exact Retrieval for Nearest-neighbor Lookup (FERN) (2405.04435v1)
Abstract: Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling -- vector retrieval -- can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension $d$ relative to the number of vectors, $N$, in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a $O(Nd)$ problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. $d=1$ vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. $d=2$ or $d=3$ cases), \texttt{kd-trees} provide a $O(d\log N)$ algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a $O(dN)$ solution at high dimensions (e.g. $k=128$), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by \texttt{kd-trees}. The algorithm achieves $O(d\log N)$ look-up with 100\% recall on 10 million $d=128$ uniformly randomly generated vectors.\footnote{Code available at https://github.com/RichardZhu123/ferns}
- David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, USA. Society for Industrial and Applied Mathematics.
- Erik Bernhardsson. 2024. erikbern/ann-benchmarks. Original-date: 2015-05-28T13:21:43Z.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Document Embedding with Paragraph Vectors. ArXiv:1507.07998 [cs].
- Randomized partition trees for exact nearest neighbor search. ArXiv.
- Matthew Francis-Landau and Benjamin Van Durme. 2019. Exact and/or Fast Nearest Neighbors. ArXiv:1910.02478 [cs].
- Fast approximate nearest neighbor search with the navigating spreading-out graph. Proceedings of the VLDB Endowment, 12(5):461–474.
- Leo J. Guibas and Robert Sedgewick. 1978. A dichromatic framework for balanced trees. In 19th Annual Symposium on Foundations of Computer Science (sfcs 1978), pages 8–21. ISSN: 0272-5428.
- Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing - STOC ’98, pages 604–613, Dallas, Texas, United States. ACM Press.
- DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137. Conference Name: IEEE Transactions on Information Theory.
- J. MacQueen. 1967. Some methods for classification and analysis of multivariate observations.
- Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68.
- Efficient Estimation of Word Representations in Vector Space. ArXiv:1301.3781 [cs].
- Learning Transferable Visual Models From Natural Language Supervision. ArXiv:2103.00020 [cs].
- Robust Speech Recognition via Large-Scale Weak Supervision. ArXiv:2212.04356 [cs, eess].
- Parikshit Ram and Kaushik Sinha. 2019. Revisiting kd-tree for Nearest Neighbor Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1378–1388, Anchorage AK USA. ACM.
- A Large-Scale Online Search System of High-Dimensional Vectors Based on Key-Value Store. In 2012 Eighth International Conference on Semantics, Knowledge and Grids, pages 233–236.