Sketch-Based LSH for Dynamic Sets
- The paper introduces a dynamic sketch-based LSH framework that supports both insertions and deletions for efficient nearest-neighbor search.
- It leverages multi-level bucketed bit-sampling and linear ℓ0 sketches to approximate similarity (e.g., Jaccard similarity) in streaming and dynamic settings.
- The method achieves sublinear memory use while delivering high recall rates (e.g., 80% for top-20 Jaccard neighbors on social networks), demonstrating its robustness and efficiency.
Sketch-based Locality Sensitive Hashing (LSH) for dynamic sets addresses the problem of efficient similarity search and exact nearest-neighbor retrieval in large-scale datasets where both insertions and deletions of elements (items, features) are permitted. This paradigm synthesizes advances in online sketching algorithms, compressed sensing, and the design of dynamic LSH functions, enabling sublinear-memory data structures for the near neighbor problem and robust similarity search in streaming and fully dynamic models (Coleman et al., 2019, Bury et al., 2016).
1. Problem Formulation and Dynamic Setting
Similarity search in set-based or vector datasets frequently leverages LSH, particularly for measures such as Jaccard similarity. The canonical setting considers a dataset or subsets under a one-pass data stream or fully dynamic stream, supporting both insertions and deletions. The objective is to construct a compact sketch of the dataset such that, given a query , one can efficiently estimate set similarities or report the exact nearest neighbors with high probability, while using memory (Coleman et al., 2019, Bury et al., 2016).
In the context of dynamic streams, updates arrive as tuples (i, j, δ), denoting an increment or decrement of the j-th coordinate for the i-th user or set. The fundamental challenge is to maintain sketch structures under arbitrary additions and deletions, a requirement not satisfied by classical min-hash or LSH schemes designed for insertion-only streams (Bury et al., 2016).
2. Sketch Structures for Approximate Similarity in Dynamic Streams
Rational set similarities, including Jaccard, admit linear sketch representations under the dynamic model. For each user , two linear sketches are maintained: one for the union and one for the symmetric difference (i.e., and , where 0). These sketches are updated in 1 time per operation and achieve a 2 multiplicative approximation to the rational set similarity or its associated metric distance (Bury et al., 2016).
For sublinear memory sketches with nearest-neighbor support, the dataset is compressed to size 3 bits for 4, determined by query-specific stability parameters. The sketch consists of short integer arrays updated in a streaming fashion and allows for the recovery of nearest neighbors in queries satisfying stability conditions (Coleman et al., 2019).
3. Dynamic LSH Constructions
The main technical contribution in dynamic LSH is the ability to process deletions without scanning the support of the underlying set. This is achieved via a bucketed bit-sampling approach:
- Each coordinate 5 is sampled independently at random granularity (level 6 for a hash 7), creating multi-level fingerprints.
- For each set, a hash table 8 of size 9 is maintained per level, counting how many sampled coordinates hit each bucket. Insertions and deletions update these counters accordingly.
- The LSH fingerprint is defined as the index of the first nonzero bucket ("min-bucket") at each level.
- Candidate generation exploits multi-level hashes and min-bucket fingerprints, analogous to Indyk–Motwani LSH for other metrics (Bury et al., 2016).
The collision probability between two sets at level 0 reflects the (sampled) similarity 1, which closely tracks 2 for well-chosen parameters. Amplification across bands and levels reduces error and concentrates sensitivity.
Table 1 summarizes the key aspects of the dynamic LSH construction.
| Structure | Update Time | Space Complexity |
|---|---|---|
| Dynamic LSH (bucketed) | 3 | 4 per user |
| 5-sketches | 6 | 7 |
4. Sublinear-Memory Sketching via LSH and Compressed Sensing
The RACE (Repeated ACE) estimator enables unbiased estimation of LSH kernel sums in the streaming model. With a family 8 of LSH functions and amplification parameter 9, each input 0 is mapped to 1 with collision probability 2. R independent arrays count occurrences, and at query time, 3 estimates kernel densities robustly (Coleman et al., 2019).
To recover the set of top 4 nearest neighbors, the target vector 5 is nearly 6-sparse and can be sketched further via a Count-Min-Sketch measurement 7. Median-of-means estimators per row give pointwise estimates, with guarantees on ordering between the 8-th and 9-th kernels. These techniques allow sublinear-memory retrieval for queries exhibiting a sufficient collision gap 0.
The sketch update per point is 1; query time is 2 for 3, and space requirements are 4.
5. Theoretical Guarantees and Performance Analysis
For nearest-neighbor retrieval, let 5 denote the 6-th largest LSH collision probability for query 7, and 8 the collision gap. Choosing 9, the memory exponent
0
ensures sublinear space when 1 (Coleman et al., 2019). The algorithm returns the true nearest neighbors of 2 with probability at least 3 and provides additive error 4 (for 5).
Experiments on social network graphs (Google Plus, Twitter, Slashdot; each 6 nodes) demonstrate that the RACE–CMS sketch reaches 80% recall for top-20 Jaccard neighbors at 7 similarity with only 8 of the original memory, outperforming random projections (9). Map-based sparse storage further reduces the RACE array footprint by up to 0 (Coleman et al., 2019).
6. Distinctions from Insertion-only Min-Hash and Related Methodologies
Classical min-hashing is not directly updatable in the dynamic setting because deletion of the minimum element requires a full rescan to determine the next smallest value. The bucketed bit-sampling approach provides dynamic support by maintaining per-level counters for all sampled coordinates and reconstructing min-buckets after arbitrary deletions in 1 (or amortized 2) time (Bury et al., 2016).
Sketch-based LSH for dynamic sets further differs from traditional LSH in its integration of linear sketching (e.g., 3 sketches) for rational set similarities and in its explicit use of compressed-sensing recovery for sparse LSH kernel signals, allowing for improved space–accuracy tradeoffs in streaming and near neighbor search.
7. Applications, Limitations, and Implications
Sketch-based LSH for dynamic sets is applicable in large-scale recommendation systems, social network friend suggestion, and streaming analytics where both additions and deletions must be handled without explicit data reconstruction. The approach supports exact 4-nearest-neighbor recovery under “stable” queries for which the collision gap 5 is sufficiently separated from 1; when most data lie within a 6-distance shell, sublinear memory is not attainable.
A plausible implication is that, for real-world graphs and set systems exhibiting natural stability, these methods provide orders-of-magnitude improvement in compression and update/query efficiency, while the integration of dynamic LSH supports robust operation in continuously updating environments.
References: (Coleman et al., 2019, Bury et al., 2016)