Density Sensitive Hashing (1205.2930v1)

Published 14 May 2012 in cs.IR and cs.LG

Abstract: Nearest neighbors search is a fundamental problem in various research fields like machine learning, data mining and pattern recognition. Recently, hashing-based approaches, e.g., Locality Sensitive Hashing (LSH), are proved to be effective for scalable high dimensional nearest neighbors search. Many hashing algorithms found their theoretic root in random projection. Since these algorithms generate the hash tables (projections) randomly, a large number of hash tables (i.e., long codewords) are required in order to achieve both high precision and recall. To address this limitation, we propose a novel hashing algorithm called {\em Density Sensitive Hashing} (DSH) in this paper. DSH can be regarded as an extension of LSH. By exploring the geometric structure of the data, DSH avoids the purely random projections selection and uses those projective functions which best agree with the distribution of the data. Extensive experimental results on real-world data sets have shown that the proposed method achieves better performance compared to the state-of-the-art hashing approaches.

Citations (198)

View on Semantic Scholar

Summary

The paper introduces a density-aware hashing method that utilizes k-means clustering to capture intrinsic data structures.
It replaces random projections with geometry-based bisecting hyperplanes to enhance both precision and computational efficiency.
Empirical results on datasets like GIST1M and SIFT1M confirm its superior performance over conventional LSH techniques.

Density Sensitive Hashing: An Enhanced Approach to High-Dimensional Nearest Neighbor Search

The paper presents a novel hashing algorithm, Density Sensitive Hashing (DSH), devised to tackle the high-dimensional nearest neighbors (NN) search challenge prevalent in fields like machine learning, data mining, and pattern recognition. Unlike conventional methods such as Locality Sensitive Hashing (LSH), which rely heavily on random projections, DSH leverages the geometric data structure to select projections, thereby improving both precision and computational efficiency.

Overview of the Problem and Current Solutions

Finding approximate nearest neighbors (ANN) efficiently in high-dimensional spaces is complex, due to the curse of dimensionality, which degrades the performance of traditional tree-based search structures. Hashing-based approaches, particularly those employing LSH, have been introduced as scalable solutions. They encode data with binary codewords while preserving pairwise similarity, generating these codewords from random projections. The Johnson-Lindenstrauss theorem establishes that many random projections are needed to maintain pairwise distances within a limited error margin, leading to significant storage and compute demands.

Learning-based approaches, including PCA Hashing, Spectral Hashing, and Anchor Graph Hashing, address these drawbacks by exploiting data structures and affinities. However, they often underperform at longer code lengths due to their reliance on spectral properties and assumptions of uniform data distribution.

Density Sensitive Hashing: Methodology

Quantization through k-means: DSH begins by applying k-means to partition the dataset, thereby capturing its geometric density. For a given dataset, it forms clusters which inform the subsequent projection generation.

Density-Sensitive Projection: Instead of arbitrary random projections, DSH generates projections using the bisecting hyperplane of cluster centers (r-adjacent groups), ensuring projections are aligned with the data's intrinsic structure.

Projection Selection via Entropy Maximization: DSH further refines the projection set by employing an entropy-based selection process, prioritizing projections that provide maximum information gain through balanced data partitions.

Experimental Validation

The paper provides empirical evaluation across datasets such as GIST1M, Flickr1M, and SIFT1M against state-of-the-art methods including LSH, Kernelized LSH (KLSH), PCA Hashing (PCAH), and Spectral Hashing (SpH). DSH demonstrates superior performance, particularly in high-dimensional spaces, illustrating the scalability and effectiveness of its geometrically-informed approach.

Implications and Future Work

DSH's integration of geometric insights into the hashing process is well-aligned for applications in large-scale data retrieval and computer vision tasks, offering a scalable solution without the prohibitive computational costs associated with high dimensionality. The explicit connection between density-aware quantization and projection selection sets a precedent for future research in efficient and adaptive similarity search methods.

Looking forward, enhancements could involve advanced clustering techniques for more nuanced data partitioning, adaptive parameter tuning, and integration with deep-learning frameworks for further improving retrieval tasks. Explorations into the robustness of DSH across various distributions and its adaptability to dynamic datasets could lead to broader applicability and effectiveness in real-world AI implementations.

PDF Markdown