Fast Exact Retrieval for Nearest-neighbor Lookup (FERN) (2405.04435v1)

Published 7 May 2024 in cs.CL and cs.DS

Abstract: Exact nearest neighbor search is a computationally intensive process, and even its simpler sibling -- vector retrieval -- can be computationally complex. This is exacerbated when retrieving vectors which have high-dimension $d$ relative to the number of vectors, $N$, in the database. Exact nearest neighbor retrieval has been generally acknowledged to be a $O(Nd)$ problem with no sub-linear solutions. Attention has instead shifted towards Approximate Nearest-Neighbor (ANN) retrieval techniques, many of which have sub-linear or even logarithmic time complexities. However, if our intuition from binary search problems (e.g. $d=1$ vector retrieval) carries, there ought to be a way to retrieve an organized representation of vectors without brute-forcing our way to a solution. For low dimension (e.g. $d=2$ or $d=3$ cases), \texttt{kd-trees} provide a $O(d\log N)$ algorithm for retrieval. Unfortunately the algorithm deteriorates rapidly to a $O(dN)$ solution at high dimensions (e.g. $k=128$), in practice. We propose a novel algorithm for logarithmic Fast Exact Retrieval for Nearest-neighbor lookup (FERN), inspired by \texttt{kd-trees}. The algorithm achieves $O(d\log N)$ look-up with 100\% recall on 10 million $d=128$ uniformly randomly generated vectors.\footnote{Code available at https://github.com/RichardZhu123/ferns}

References (18)

Summary

The paper introduces FERN, which uses a hyperplane-based binary tree to enable fast, exact nearest-neighbor lookup in high-dimensional spaces.
It maintains near-logarithmic lookup times by dynamically balancing nodes during insertion and utilizing efficient backtracking.
Experimental results on 10 million 128-dimensional vectors on an Intel Xeon CPU demonstrate FERN’s scalability and high performance.

Exploring FERN: A Novel Algorithm for Fast Exact Retrieval in High-Dimensional Vector Spaces

Introduction to Vector Retrieval Challenges

In the world of data science, efficiently retrieving high-dimensional vectors is a key challenge, particularly in applications involving search engines, transformers, and large-scale LLMs. Traditional methods like hashmaps or binary trees start to falter as the dimensionality and dataset size increase. This is due to their time complexity scaling poorly with increases in dimension (d) and database size (N).

Previous Approaches

Several strategies have been developed to handle vector retrieval more effectively:

Locality Sensitive Hashing and k-means: These bucketing approaches group nearby vectors together, simplifying the search space. However, they are approximate and struggle with edge cases where queries lie close to cluster boundaries.
Graph-Based Methods: Techniques like Navigable Small World graphs represent the database as a graph, shortening the paths needed to traverse it. Despite its improvements, this method still faces challenges with high dimensionality.
Divide and Conquer: While kd-trees are efficient in lower dimensions, their performance degrades as dimensions increase, becoming akin to a linear search in worst scenarios.

Introduction of FERN

Fast Exact Retrieval for Nearest-neighbor lookup (FERN) is a new algorithmic approach tailored to overcome the limitations of existing methods. FERN retains the logarithmic time complexity of kd-trees but adapts more robustly to high dimensions. Here’s how FERN works:

Binary Tree Structure: At its core, FERN uses a binary tree where each node helps define a hyperplane based on its children's vectors. This structure ensures that all vectors in a subtree remain on one side of their parent node's dividing hyperplane.
Logarithmic Lookup Time: Thanks to its tree-based structure, FERN ensures that lookup operations remain logarithmic with respect to N, akin to traversing the height of a balanced binary tree.
Dynamic Node Linking: Nodes in the tree hold not only vector data but pointers to their parent and children nodes, allowing for dynamic updates and efficient backtracking.

Practical Implementation

The implementation of FERN involves two main operations—insertion and lookup:

Insertion: Vectors are added to the tree based on their proximity to existing nodes, ensuring that the tree remains balanced and the vectors are correctly positioned relative to the hyperplanes.
Lookup: Retrieval involves navigating the tree from the root, progressively narrowing down the search space until the target vector or its nearest neighbor is found.

Experimental Validation

Using an Intel Xeon Platinum 8380 CPU, FERN demonstrated impressive performance on a dataset of 10 million uniformly random high-dimensional vectors (d=128). The results showcased near-logarithmic lookup times without requiring further optimization, a considerable achievement compared to traditional methods.

Discussion and Future Work

While FERN excels in structured vector databases, its efficiency in scenarios with arbitrary vector insertions or adversarial input data could be further explored. Additionally, its ability to transition from exact lookup to approximate nearest neighbor search without significant performance loss is another potential area of development.

Conclusion

FERN represents a significant advancement in vector retrieval, especially in high-dimensional spaces. Its innovative approach to using spatially aware data structures ensures efficiency and accuracy, making it a promising tool for future research and application in data-heavy environments. As with any new technology, continuous testing, optimization, and adaptation will be key to unlocking its full potential.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1788071402941374785

https://twitter.com/dingoactual/status/1788475221999026380

https://twitter.com/therichardzhu/status/1788311346259304939

https://twitter.com/therichardzhu/status/1788670180794531848

https://twitter.com/knishimae0531/status/1788390490041364674

https://twitter.com/AlgorithmPapers/status/1788147545266352636