Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Exact Retrieval for Nearest-neighbor Lookup (FERN) (2405.04435v1)

Published 7 May 2024 in cs.CL and cs.DS

Abstract: Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling -- vector retrieval -- can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension $d$ relative to the number of vectors, $N$, in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a $O(Nd)$ problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. $d=1$ vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. $d=2$ or $d=3$ cases), \texttt{kd-trees} provide a $O(d\log N)$ algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a $O(dN)$ solution at high dimensions (e.g. $k=128$), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by \texttt{kd-trees}. The algorithm achieves $O(d\log N)$ look-up with 100\% recall on 10 million $d=128$ uniformly randomly generated vectors.\footnote{Code available at https://github.com/RichardZhu123/ferns}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’07, pages 1027–1035, USA. Society for Industrial and Applied Mathematics.
  2. Erik Bernhardsson. 2024. erikbern/ann-benchmarks. Original-date: 2015-05-28T13:21:43Z.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4.
  4. Document Embedding with Paragraph Vectors. ArXiv:1507.07998 [cs].
  5. Randomized partition trees for exact nearest neighbor search. ArXiv.
  6. Matthew Francis-Landau and Benjamin Van Durme. 2019. Exact and/or Fast Nearest Neighbors. ArXiv:1910.02478 [cs].
  7. Fast approximate nearest neighbor search with the navigating spreading-out graph. Proceedings of the VLDB Endowment, 12(5):461–474.
  8. Leo J. Guibas and Robert Sedgewick. 1978. A dichromatic framework for balanced trees. In 19th Annual Symposium on Foundations of Computer Science (sfcs 1978), pages 8–21. ISSN: 0272-5428.
  9. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing - STOC ’98, pages 604–613, Dallas, Texas, United States. ACM Press.
  10. DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  11. S. Lloyd. 1982. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137. Conference Name: IEEE Transactions on Information Theory.
  12. J. MacQueen. 1967. Some methods for classification and analysis of multivariate observations.
  13. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68.
  14. Efficient Estimation of Word Representations in Vector Space. ArXiv:1301.3781 [cs].
  15. Learning Transferable Visual Models From Natural Language Supervision. ArXiv:2103.00020 [cs].
  16. Robust Speech Recognition via Large-Scale Weak Supervision. ArXiv:2212.04356 [cs, eess].
  17. Parikshit Ram and Kaushik Sinha. 2019. Revisiting kd-tree for Nearest Neighbor Search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1378–1388, Anchorage AK USA. ACM.
  18. A Large-Scale Online Search System of High-Dimensional Vectors Based on Key-Value Store. In 2012 Eighth International Conference on Semantics, Knowledge and Grids, pages 233–236.

Summary

  • The paper introduces FERN, which uses a hyperplane-based binary tree to enable fast, exact nearest-neighbor lookup in high-dimensional spaces.
  • It maintains near-logarithmic lookup times by dynamically balancing nodes during insertion and utilizing efficient backtracking.
  • Experimental results on 10 million 128-dimensional vectors on an Intel Xeon CPU demonstrate FERN’s scalability and high performance.

Exploring FERN: A Novel Algorithm for Fast Exact Retrieval in High-Dimensional Vector Spaces

Introduction to Vector Retrieval Challenges

In the world of data science, efficiently retrieving high-dimensional vectors is a key challenge, particularly in applications involving search engines, transformers, and large-scale LLMs. Traditional methods like hashmaps or binary trees start to falter as the dimensionality and dataset size increase. This is due to their time complexity scaling poorly with increases in dimension (d) and database size (N).

Previous Approaches

Several strategies have been developed to handle vector retrieval more effectively:

  • Locality Sensitive Hashing and k-means: These bucketing approaches group nearby vectors together, simplifying the search space. However, they are approximate and struggle with edge cases where queries lie close to cluster boundaries.
  • Graph-Based Methods: Techniques like Navigable Small World graphs represent the database as a graph, shortening the paths needed to traverse it. Despite its improvements, this method still faces challenges with high dimensionality.
  • Divide and Conquer: While kd-trees are efficient in lower dimensions, their performance degrades as dimensions increase, becoming akin to a linear search in worst scenarios.

Introduction of FERN

Fast Exact Retrieval for Nearest-neighbor lookup (FERN) is a new algorithmic approach tailored to overcome the limitations of existing methods. FERN retains the logarithmic time complexity of kd-trees but adapts more robustly to high dimensions. Here’s how FERN works:

  1. Binary Tree Structure: At its core, FERN uses a binary tree where each node helps define a hyperplane based on its children's vectors. This structure ensures that all vectors in a subtree remain on one side of their parent node's dividing hyperplane.
  2. Logarithmic Lookup Time: Thanks to its tree-based structure, FERN ensures that lookup operations remain logarithmic with respect to N, akin to traversing the height of a balanced binary tree.
  3. Dynamic Node Linking: Nodes in the tree hold not only vector data but pointers to their parent and children nodes, allowing for dynamic updates and efficient backtracking.

Practical Implementation

The implementation of FERN involves two main operations—insertion and lookup:

  • Insertion: Vectors are added to the tree based on their proximity to existing nodes, ensuring that the tree remains balanced and the vectors are correctly positioned relative to the hyperplanes.
  • Lookup: Retrieval involves navigating the tree from the root, progressively narrowing down the search space until the target vector or its nearest neighbor is found.

Experimental Validation

Using an Intel Xeon Platinum 8380 CPU, FERN demonstrated impressive performance on a dataset of 10 million uniformly random high-dimensional vectors (d=128). The results showcased near-logarithmic lookup times without requiring further optimization, a considerable achievement compared to traditional methods.

Discussion and Future Work

While FERN excels in structured vector databases, its efficiency in scenarios with arbitrary vector insertions or adversarial input data could be further explored. Additionally, its ability to transition from exact lookup to approximate nearest neighbor search without significant performance loss is another potential area of development.

Conclusion

FERN represents a significant advancement in vector retrieval, especially in high-dimensional spaces. Its innovative approach to using spatially aware data structures ensures efficiency and accuracy, making it a promising tool for future research and application in data-heavy environments. As with any new technology, continuous testing, optimization, and adaptation will be key to unlocking its full potential.