Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sketch-Based LSH for Dynamic Sets

Updated 9 April 2026
  • The paper introduces a dynamic sketch-based LSH framework that supports both insertions and deletions for efficient nearest-neighbor search.
  • It leverages multi-level bucketed bit-sampling and linear ℓ0 sketches to approximate similarity (e.g., Jaccard similarity) in streaming and dynamic settings.
  • The method achieves sublinear memory use while delivering high recall rates (e.g., 80% for top-20 Jaccard neighbors on social networks), demonstrating its robustness and efficiency.

Sketch-based Locality Sensitive Hashing (LSH) for dynamic sets addresses the problem of efficient similarity search and exact nearest-neighbor retrieval in large-scale datasets where both insertions and deletions of elements (items, features) are permitted. This paradigm synthesizes advances in online sketching algorithms, compressed sensing, and the design of dynamic LSH functions, enabling sublinear-memory data structures for the near neighbor problem and robust similarity search in streaming and fully dynamic models (Coleman et al., 2019, Bury et al., 2016).

1. Problem Formulation and Dynamic Setting

Similarity search in set-based or vector datasets frequently leverages LSH, particularly for measures such as Jaccard similarity. The canonical setting considers a dataset D={x1,,xN}Rd\mathcal{D} = \{x_1,\dots,x_N\} \subset \mathbb{R}^d or subsets A,BUA,B \subset U under a one-pass data stream or fully dynamic stream, supporting both insertions and deletions. The objective is to construct a compact sketch S\mathcal{S} of the dataset such that, given a query qq, one can efficiently estimate set similarities or report the exact vv nearest neighbors with high probability, while using o(N)o(N) memory (Coleman et al., 2019, Bury et al., 2016).

In the context of dynamic streams, updates arrive as tuples (i, j, δ), denoting an increment or decrement of the j-th coordinate for the i-th user or set. The fundamental challenge is to maintain sketch structures under arbitrary additions and deletions, a requirement not satisfied by classical min-hash or LSH schemes designed for insertion-only streams (Bury et al., 2016).

2. Sketch Structures for Approximate Similarity in Dynamic Streams

Rational set similarities, including Jaccard, admit linear sketch representations under the dynamic model. For each user ii, two linear 0\ell_0 sketches are maintained: one for the union and one for the symmetric difference (i.e., AiAj=0(ai+aj)|A^i \cup A^j| = \ell_0(a^i + a^j) and AiΔAj=0(aiaj)|A^i \Delta A^j| = \ell_0(a^i - a^j), where A,BUA,B \subset U0). These sketches are updated in A,BUA,B \subset U1 time per operation and achieve a A,BUA,B \subset U2 multiplicative approximation to the rational set similarity or its associated metric distance (Bury et al., 2016).

For sublinear memory sketches with nearest-neighbor support, the dataset is compressed to size A,BUA,B \subset U3 bits for A,BUA,B \subset U4, determined by query-specific stability parameters. The sketch consists of short integer arrays updated in a streaming fashion and allows for the recovery of nearest neighbors in queries satisfying stability conditions (Coleman et al., 2019).

3. Dynamic LSH Constructions

The main technical contribution in dynamic LSH is the ability to process deletions without scanning the support of the underlying set. This is achieved via a bucketed bit-sampling approach:

  • Each coordinate A,BUA,B \subset U5 is sampled independently at random granularity (level A,BUA,B \subset U6 for a hash A,BUA,B \subset U7), creating multi-level fingerprints.
  • For each set, a hash table A,BUA,B \subset U8 of size A,BUA,B \subset U9 is maintained per level, counting how many sampled coordinates hit each bucket. Insertions and deletions update these counters accordingly.
  • The LSH fingerprint is defined as the index of the first nonzero bucket ("min-bucket") at each level.
  • Candidate generation exploits multi-level hashes and min-bucket fingerprints, analogous to Indyk–Motwani LSH for other metrics (Bury et al., 2016).

The collision probability between two sets at level S\mathcal{S}0 reflects the (sampled) similarity S\mathcal{S}1, which closely tracks S\mathcal{S}2 for well-chosen parameters. Amplification across bands and levels reduces error and concentrates sensitivity.

Table 1 summarizes the key aspects of the dynamic LSH construction.

Structure Update Time Space Complexity
Dynamic LSH (bucketed) S\mathcal{S}3 S\mathcal{S}4 per user
S\mathcal{S}5-sketches S\mathcal{S}6 S\mathcal{S}7

4. Sublinear-Memory Sketching via LSH and Compressed Sensing

The RACE (Repeated ACE) estimator enables unbiased estimation of LSH kernel sums in the streaming model. With a family S\mathcal{S}8 of LSH functions and amplification parameter S\mathcal{S}9, each input qq0 is mapped to qq1 with collision probability qq2. R independent arrays count occurrences, and at query time, qq3 estimates kernel densities robustly (Coleman et al., 2019).

To recover the set of top qq4 nearest neighbors, the target vector qq5 is nearly qq6-sparse and can be sketched further via a Count-Min-Sketch measurement qq7. Median-of-means estimators per row give pointwise estimates, with guarantees on ordering between the qq8-th and qq9-th kernels. These techniques allow sublinear-memory retrieval for queries exhibiting a sufficient collision gap vv0.

The sketch update per point is vv1; query time is vv2 for vv3, and space requirements are vv4.

5. Theoretical Guarantees and Performance Analysis

For nearest-neighbor retrieval, let vv5 denote the vv6-th largest LSH collision probability for query vv7, and vv8 the collision gap. Choosing vv9, the memory exponent

o(N)o(N)0

ensures sublinear space when o(N)o(N)1 (Coleman et al., 2019). The algorithm returns the true nearest neighbors of o(N)o(N)2 with probability at least o(N)o(N)3 and provides additive error o(N)o(N)4 (for o(N)o(N)5).

Experiments on social network graphs (Google Plus, Twitter, Slashdot; each o(N)o(N)6 nodes) demonstrate that the RACE–CMS sketch reaches 80% recall for top-20 Jaccard neighbors at o(N)o(N)7 similarity with only o(N)o(N)8 of the original memory, outperforming random projections (o(N)o(N)9). Map-based sparse storage further reduces the RACE array footprint by up to ii0 (Coleman et al., 2019).

Classical min-hashing is not directly updatable in the dynamic setting because deletion of the minimum element requires a full rescan to determine the next smallest value. The bucketed bit-sampling approach provides dynamic support by maintaining per-level counters for all sampled coordinates and reconstructing min-buckets after arbitrary deletions in ii1 (or amortized ii2) time (Bury et al., 2016).

Sketch-based LSH for dynamic sets further differs from traditional LSH in its integration of linear sketching (e.g., ii3 sketches) for rational set similarities and in its explicit use of compressed-sensing recovery for sparse LSH kernel signals, allowing for improved space–accuracy tradeoffs in streaming and near neighbor search.

7. Applications, Limitations, and Implications

Sketch-based LSH for dynamic sets is applicable in large-scale recommendation systems, social network friend suggestion, and streaming analytics where both additions and deletions must be handled without explicit data reconstruction. The approach supports exact ii4-nearest-neighbor recovery under “stable” queries for which the collision gap ii5 is sufficiently separated from 1; when most data lie within a ii6-distance shell, sublinear memory is not attainable.

A plausible implication is that, for real-world graphs and set systems exhibiting natural stability, these methods provide orders-of-magnitude improvement in compression and update/query efficiency, while the integration of dynamic LSH supports robust operation in continuously updating environments.

References: (Coleman et al., 2019, Bury et al., 2016)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sketch-Based LSH for Dynamic Sets.