Rabin–Karp Rolling Hashing
- Rabin–Karp rolling hashing is a technique that uses a polynomial hash function updated in constant time as a sliding window moves through the string.
- It carefully selects parameters like base and modulus to optimize collision resistance, injectivity, and performance for tasks such as document fingerprinting and cryptographic applications.
- Its efficiency (average O(n+m) time) and reversible design facilitate high-throughput analytics and integration in zero-knowledge, privacy-preserving proofs.
The Rabin–Karp rolling hash is a fundamental primitive for efficient string-matching, -gram indexing, document fingerprinting, and privacy-preserving protocols. Its hallmark is a polynomial hash function that can be updated in constant time as a fixed-length sliding window moves through a string, enabling rapid substring detection and comparison. The method supports both classical implementations for high-throughput analytics and advanced cryptographic applications such as zk-SNARK–based privacy-preserving string matching. The polynomial-based structure also allows for reversible (injective) algorithmic constructions and has direct impact on collision properties and statistical independence.
1. Mathematical Definition and Rolling Update Rule
Let be a string of length over an alphabet that is injectively mapped to integers, and fix a base (often denoted in some literature) and a prime modulus (or ). The Rabin–Karp polynomial hash of is
or, equivalently,
For a window of length at position in a longer string , the hash is computed over .
The rolling update rule exploits the overlap between consecutive windows:
where:
- The outgoing high-order symbol weighted by is subtracted and removed,
- The resulting hash is multiplied by to shift digit positions,
- The incoming symbol is added at the lower-order position,
- The final sum is reduced modulo to keep the values bounded.
Precomputation of allows for the subtraction step to be computed as , improving efficiency (Glück et al., 2022, Li et al., 20 May 2025, 0705.4676).
2. Parameterization, Collision Behavior, and Injectivity
Parameter choice is central to the method's collision resistance, update efficiency, and reversibility:
- Base : Typically a small prime slightly larger than the alphabet (e.g., for ASCII). Larger bases distribute hash values more widely but incur identical modular multiplication costs.
- Modulus : Preference is given to a large prime (e.g., or matching the native field of cryptosystems). Collision probability behaves as , so on the order of or renders collisions negligible for texts of length up to .
- Coprimality: and are required to satisfy , ensuring that modular multiplication is invertible and the rolling hash update is injective—a necessity for reversible computation (Glück et al., 2022).
For reversible implementations, the update
is injective under
- ,
- $0 < b < q$ and ,
- .
Every modular arithmetic operation (, , ) is injective in its primary argument under these conditions. Reversibility is achieved without extra space or information loss via in-place, bijective updates (Glück et al., 2022).
3. Algorithmic Implementation and Complexity
A standard Rabin–Karp matcher operates in two phases:
- Initialization: Hashes for the pattern and the first text window are computed in . For hash update, the only precomputed value is .
- Sliding: For each shift , the rolling hash update is performed in time, with explicit computation as above. Matches on hash equality are validated through a character-wise check to eliminate false positives due to hash collisions. The overall time complexity is in expectation. In the worst (collision-saturated) case, complexity degrades to (Li et al., 20 May 2025, 0705.4676).
Pseudocode for rolling hash update (with precomputed ):
1 2 3 4 5 |
Algorithm SlideHash(old_hash, outgoing_char, incoming_char)
temp ← old_hash − outgoing_char⋅pow_b_m1
temp ← temp mod q // ensure non-negative
new_hash ← (temp⋅b + incoming_char) mod q
return new_hash |
4. Statistical Independence and Limitations
No recursive (rolling) hash scheme, including the Rabin–Karp family, can achieve more than pairwise independence. Specifically, even randomized Karp–Rabin hashes fail to be pairwise independent for : distinct -grams may collide at rates above the uniform baseline. Only specialized polynomial schemes, notably over with irreducible and via independent random symbol mappings, are provably pairwise independent, with cost per update for -bit hashes (0705.4676).
Key results:
- No 3-wise independence: For any recursive hash family, triple overlapping windows cannot be made fully independent due to the deterministic link introduced by the rolling recurrence.
- Uniformity only: Randomized Karp–Rabin schemes can be tuned to produce uniform output, but pairwise (let alone -wise) independence is unattainable except via computation from scratch with cost (0705.4676).
Polynomial-based approaches over :
- General (irreducible) polynomials yield pairwise independence without bit discarding but require either shifts or memory for buffered operations.
- Cyclic polynomials (): Doubly fast, but require the discard of hash bits to recover pairwise independence. The remaining bits then serve as a pairwise independent hash (0705.4676).
5. Practical Applications: High-Throughput Matching and Cryptographic Protocols
The Rabin–Karp rolling hash is foundational in substring search and -gram analytics. It is deployed in applications such as plagiarism detection, document fingerprinting, and frequency analysis due to its update and low false-positive rate in well-chosen parameter regimes. In cryptographic and privacy-preserving regimes, Rabin–Karp enables efficient zero-knowledge proofs for substring inclusion without text or pattern disclosure, as in zk-SNARK-based constructions (Li et al., 20 May 2025). In such protocols:
- The rolling hash function is instantiated within arithmetic circuits native to the proving system, with parameters chosen to match the underlying field, obviating explicit modular reduction.
- All internal hash values remain secret, and only compact, verifiable assertions of substring presence are revealed, ensuring both privacy and computational efficiency.
The rolling hash update integrates into circuit design with linear constraint complexity in , with succinct, zero-knowledge proofs generated for pattern matching—a method that scales efficiently to large input sizes (Li et al., 20 May 2025).
6. Performance Benchmarks and Memory-Throughput Trade-Offs
Comparative benchmarking yields empirical insights:
- Randomized Karp–Rabin (ID37): Fastest for uniformity-only tasks (e.g., up to 0.3 s for $100$M -gram hashes), but has higher collision rates in large-scale tests; not pairwise independent.
- Polynomial (General, GF(2) with irreducible ): Throughput of $3.5$–$4$ million updates per second; pairwise independent; higher memory cost if buffering is used.
- Polynomial (Cyclic): Throughput $6$–$8$ million updates per second; requires hash-bit dropping to achieve pairwise independence; memory-efficient (no buffer required).
- Non-recursive 3-wise independent schemes: Throughput is per update; becomes less practical as window size increases.
Memory overhead is dominated by symbol-to-random-value tables ; the cost of exponential buffering arises only for certain polynomial schemes that require storage to keep update complexity minimal (0705.4676).
| Hash Family | Independence | Per-update Speed | Memory Cost |
|---|---|---|---|
| Randomized Karp–Rabin | Uniform only | Fastest | Small |
| Polynomial, General | Pairwise | Moderate | (optional) |
| Polynomial, Cyclic | Pairwise* | Fastest | |
| Non-recursive 3-wise | -wise | Slower () | Large for table |
*After discarding bits.
7. Worked Example and Integration with Reversible and Privacy-Preserving Algorithms
For illustrative purposes, consider the text “abracadabra” and pattern “abra” with , , :
- Precompute: .
- is calculated recursively and found to be .
- First window: matches directly.
- Next window: Compute , then .
In reversible algorithm design, the injectivity of the modular steps allows for clean, space-free construction without irreversible state updates. Within zk-SNARK circuits, parameters (, , ) are hardcoded, the input strings are private, and only the compiled proof of substring presence is revealed (Li et al., 20 May 2025, Glück et al., 2022).