Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hashing for Similarity Search: A Survey (1408.2927v1)

Published 13 Aug 2014 in cs.DS, cs.CV, and cs.DB

Abstract: Similarity search (nearest neighbor search) is a problem of pursuing the data items whose distances to a query item are the smallest from a large database. Various methods have been developed to address this problem, and recently a lot of efforts have been devoted to approximate search. In this paper, we present a survey on one of the main solutions, hashing, which has been widely studied since the pioneering work locality sensitive hashing. We divide the hashing algorithms two main categories: locality sensitive hashing, which designs hash functions without exploring the data distribution and learning to hash, which learns hash functions according the data distribution, and review them from various aspects, including hash function design and distance measure and search scheme in the hash coding space.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jingdong Wang (236 papers)
  2. Heng Tao Shen (117 papers)
  3. Jingkuan Song (115 papers)
  4. Jianqiu Ji (1 paper)
Citations (548)

Summary

  • The paper presents a unified survey of hashing techniques that efficiently tackle the computational challenges of approximate nearest neighbor search in high-dimensional spaces.
  • It categorizes methods into locality-sensitive hashing and learning-to-hash, detailing algorithmic designs for managing collision probabilities and semantic similarity.
  • The work highlights future directions for scalable and adaptive hashing solutions in real-time similarity search applications.

Hashing for Similarity Search: A Survey

The paper "Hashing for Similarity Search: A Survey" by Jingdong Wang et al. provides a comprehensive examination of hashing methodologies applied to approximate nearest neighbor (ANN) search problems. This work categorizes the hashing algorithms into two primary groups: locality-sensitive hashing (LSH) and learning-to-hash (LTH) techniques. Both methodologies address the computational challenges associated with similarity searches within large-scale, high-dimensional datasets.

Locality Sensitive Hashing (LSH)

Locality-sensitive hashing, a probabilistic method, focuses on reducing the complexity of similarity search problems by mapping high-dimensional data to a lower-dimensional space. The LSH approach ensures that similar data points have a higher probability of being hashed to the same bucket, thus enabling efficient retrieval. LSH is explicitly designed with the (R,c)(R, c)-near neighbor problem in mind, where the primary goal is to locate neighbors within an approximate range rather than finding exact matches.

The paper details various implementations of LSH for different metric spaces:

  • Euclidean space methodologies, including pp-stable distributions and Leech lattice LSH.
  • Angular-based distance measures utilizing random projections.
  • Jaccard coefficient analysis for set-based similarity metrics.

An essential aspect of LSH involves the balance between hash table and distance computation strategies, with a focus on collisions that probabilistically ensure that nearby points will hash to the same bucket.

Learning to Hash (LTH)

Distinct from LSH, learning-to-hash methods are predominantly data-driven and often leverage machine learning techniques to derive optimal hash functions based on the data distribution. The key differentiator is that LTH seeks to encode semantic similarity between data points into concise binary representations.

Highlighted approaches include:

  • Spectral hashing which aligns binary codes with graph Laplacians to retain similarities.
  • Iterative quantization and isotropic hashing which focus on optimizing the quantization error or variance distribution across hash bits.
  • Angular quantization which emphasizes preserving cosine similarities through careful binarization.

Learning-to-hash strategies frequently integrate supervised or semi-supervised learning paradigms, emphasizing coding consistency and balance, thus addressing challenges like maintaining uniform code distributions and minimizing quantization error.

Theoretical and Practical Implications

The robust survey of methods in this paper offers numerous insights into potential avenues for scaling ANN search efficiency and highlights trade-offs inherent in designing hashing functions. By combining insights from both LSH’s probabilistic guarantees and LTH’s learning capabilities, future developments can potentially further balance computational cost and retrieval accuracy. There is significant potential in areas such as cross-modal retrieval systems or applications needing real-time similarity search.

Future Directions

As similarity search continues to be a crucial component in data-intensive applications like image retrieval and natural language processing, further innovation in hashing algorithms is imperative. Emerging trends suggest opportunities in the development of computationally efficient methods for large-scale data processing, including scalable hash function learning and adaptive query strategies.

In conclusion, hashing remains a pivotal technique for ANN and similarity search, and this paper serves as a meticulous guideline for researchers and practitioners seeking to deepen their understanding or apply hashing in sophisticated data environments.