Fast and scalable minimal perfect hashing for massive key sets (1702.03154v2)

Published 10 Feb 2017 in cs.DS

Abstract: Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of $10^{10}$ elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality $10^{12}$. Source code: https://github.com/rizkg/BBHash

Citations (90)

View on Semantic Scholar

Summary

The paper introduces a novel MPHF algorithm that significantly reduces space and improves query speed for massive key sets.
It employs a refined bit array collision strategy with a tunable parameter to dynamically balance memory usage and construction speed.
Experimental results show the method constructs indexes for 10^10 keys in under seven minutes and scales to 10^12 keys with efficient parallelization.

Overview of "Fast and scalable minimal perfect hashing for massive key sets"

The paper "Fast and scalable minimal perfect hashing for massive key sets" by Limasset et al. presents a novel algorithm for constructing Minimal Perfect Hash Functions (MPHFs) with enhanced efficiency in terms of space and time complexity. MPHFs are essential in applications requiring index structures with minimal space overhead and constant query time, such as in bioinformatics and networking.

Algorithmic Contributions

The authors revisit a simple hashing strategy using bit arrays to track collisions during hash construction. This methodology is refined to provide competitive results against other state-of-the-art approaches, particularly when handling vast datasets comprising billions or trillions of keys. One key feature of their method is the tuning parameter $\gamma$ , which influences the trade-off between memory usage and query/construction speed.

Their algorithm uniquely minimizes memory usage by dynamically updating a bit array during construction without partitioning the input data, thus supporting scalability to larger data sizes. The method ensures that construction requires marginally more space than the size of the final MPHF itself. Importantly, parallelization is achieved by partitioning keys across multiple threads, enhancing processing speed.

Performance Evaluation

The presented implementation, authored in C++, is capable of processing large datasets efficiently. Experimental results indicate that an MPHF for $10^{10}$ keys can be generated in under seven minutes using a modest amount of memory (5 GB), with a resultant storage cost of 3.7 bits per key. This performance metric notably surpasses previous implementations concerning space efficiency and scalability, with successful tests noted up to $10^{12}$ keys, which took approximately 36 hours and 637 GB RAM.

Comparative Analysis

The empirical studies consider competing methodologies including CHD and EMPHF. The implementation from this paper demonstrates significantly lower construction memory overhead and faster computation times. Yet, it accomplishes minimally higher MPHF sizes, suggesting a reasonable compromise for the pronounced efficiency gains in construction.

Practical and Theoretical Implications

The method is particularly advantageous for large-scale applications where traditional methodologies falter due to excessive resource demands. By reducing the memory and disk footprint significantly, this approach enables practical deployment in real-time scenarios involving massive key sets. The algorithm sheds light on optimizing MPHFs' construction with a focus on simplicity and effectiveness, hinting at broader applications in big data contexts.

Future Directions

Future research could focus on further decreasing the space requirements of MPHFs derived from this method by integrating more sophisticated selection algorithms for hash functions. Additionally, exploring hybrid approaches that dynamically adjust the $\gamma$ parameter could optimize performance across diverse use cases. The scalability and efficiency marked by this strategy pave the way for more expansive research in large-scale data indexing and retrieval operations.

In conclusion, the proposed algorithm by Limasset et al. represents a significant stride in handling extensive key sets efficiently, augmenting the state of the art in MPHFs construction. Their solution empowers researchers and practitioners to undertake substantial data processing tasks with improved performance metrics, notably in fields with exponential data growth.

PDF Markdown

Related Papers

GitHub

GitHub - rizkg/BBHash: Bloom-filter based minimal perfect hash function library (253 stars)

Tweets

https://twitter.com/dgryski/status/831448557370667008

https://twitter.com/BQPMalfoy/status/1584901981268738050

https://twitter.com/p_peterlongo/status/831047477319065600

https://twitter.com/pmelsted/status/1251262156248662016

YouTube

Show All Videos