Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAMBO: Repeated And Merged BloOm Filter for Ultra-fast Multiple Set Membership Testing (MSMT) on Large-Scale Data (1910.02611v2)

Published 7 Oct 2019 in cs.DS and cs.IR

Abstract: Multiple Set Membership Testing (MSMT) is a well-known problem in a variety of search and query applications. Given a dataset of K different sets and a query q, it aims to find all of the sets containing the query. Trivially, an MSMT instance can be reduced to K membership testing instances, each with the same q, leading to O(K) query time with a simple array of Bloom Filters. We propose a data-structure called RAMBO (Repeated And Merged BloOm Filter) that achieves O(\sqrt{K} log K) query time in expectation with an additional worst-case memory cost factor of O(log K) beyond the array of Bloom Filters. Due to this, RAMBO is a very fast and accurate data-structure. Apart from being embarrassingly parallel, supporting cheap updates for streaming inputs, zero false-negative rate, and low false-positive rate, RAMBO beats the state-of-the-art approaches for genome indexing methods: COBS (Compact bit-sliced signature index), Sequence Bloom Trees (a Bloofi based implementation), HowDeSBT, SSBT, and document indexing methods like BitFunnel. The proposed data-structure is simply a count-min sketch type arrangement of a membership testing utility (Bloom Filter in our case). It indexes k-grams and provides an approximate membership testing based search utility. The simplicity of the algorithm and embarrassingly parallel architecture allows us to index a 170 TB genome dataset in a mere 14 hours on a cluster of 100 nodes while competing methods require weeks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Gaurav Gupta (44 papers)
  2. Minghao Yan (8 papers)
  3. Benjamin Coleman (21 papers)
  4. R. A. Leo Elworth (2 papers)
  5. Tharun Medini (11 papers)
  6. Todd Treangen (3 papers)
  7. Anshumali Shrivastava (102 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com