Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bayesian Locality Sensitive Hashing for Fast Similarity Search (1110.1328v3)

Published 6 Oct 2011 in cs.DB, cs.AI, cs.DS, and cs.IR

Abstract: Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based methods have become a very popular approach for this problem. However, most such methods only use LSH for the first phase of similarity search - i.e. efficient indexing for candidate generation. In this paper, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search - performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. BayesLSH is able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH's output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2x-20x for a wide variety of datasets.

Citations (160)

Summary

  • The paper introduces BayesLSH, demonstrating that applying Bayesian inference to LSH enhances candidate verification and similarity estimation while maintaining recall rates above 97%.
  • It utilizes a flexible trade-off between accuracy and speed by effectively pruning false positives after examining only a fraction of hashes, achieving speedups of up to 20x.
  • The approach offers probabilistic guarantees with tunable parameters, bridging Bayesian methods and LSH to improve similarity search across diverse metrics and data types.

This paper presents a novel approach, termed BayesLSH, to enhance locality-sensitive hashing (LSH)-based methods for fast similarity search. The authors distinguish their work within the field of all-pairs similarity search, addressing the problem of identifying object pairs with a similarity above a user-specified threshold. While traditional LSH methods efficiently address the candidate generation phase, they falter in candidate verification and similarity estimation. BayesLSH aims to improve upon this through principled Bayesian inference.

Bayesian Algorithm for LSH

The core innovation of BayesLSH lies in its application of Bayesian inference to LSH-based candidate verification and similarity estimation. The approach enables probabilistic reasoning about the likelihood that candidate pairs exceed the threshold without the exhaustive computation of exact similarities. BayesLSH employs a flexible, tunable trade-off between accuracy and computational speed using parameters that control recall and estimation error. Notably, this method does not require manual tuning of the number of hashes, a significant advantage over conventional LSH techniques.

BayesLSH introduces probabilistic guarantees for output quality, ensuring recall rates are above a certain threshold while maintaining accuracy within specified bounds. The procedure offers considerable reductions in computational overhead by pruning false positives early in the hash comparison process—often after examining just a fraction of hashes—and providing controlled similarity estimation.

Experimental Results and Implications

The authors validate their approach against state-of-the-art candidate generation algorithms, AllPairs, and standard LSH, across multiple datasets. BayesLSH and its variant BayesLSH-Lite demonstrated substantial speedups ranging between 2x to 20x compared to baseline methods. These improvements stem primarily from the effective pruning capabilities of BayesLSH, which could quickly dismiss false positives, leading to faster similarity computations without sacrificing recall—consistently maintaining recall rates above 97%. Furthermore, BayesLSH's probabilistic framework ensures greater consistency in similarity estimation, avoiding the excessive high-error counts observed in traditional LSH methods at lower similarity thresholds.

Theoretical Contributions and Future Directions

This paper's theoretical contributions lie in bridging Bayesian inference with LSH, offering a robust framework for similarity search that is adaptable to various metrics, including Jaccard and Cosine similarities. By addressing both binary and real-valued vectors, BayesLSH expands the applicability of similarity search across diverse domains.

Looking forward, the exploration of BayesLSH's applicability to learned similarity metrics presents a promising direction. Additionally, adapting Bayesian approaches to candidate pruning in nearest neighbor retrieval within Euclidean spaces could further broaden its utility.

In conclusion, while avoiding claims of revolutionizing the field, BayesLSH provides a significant advancement in similarity search methodologies, offering a meticulous balance between performance and accuracy with broad implications across data mining and machine learning applications.