Rabin–Karp Rolling Hashing

Updated 3 February 2026

Rabin–Karp rolling hashing is a technique that uses a polynomial hash function updated in constant time as a sliding window moves through the string.
It carefully selects parameters like base and modulus to optimize collision resistance, injectivity, and performance for tasks such as document fingerprinting and cryptographic applications.
Its efficiency (average O(n+m) time) and reversible design facilitate high-throughput analytics and integration in zero-knowledge, privacy-preserving proofs.

The Rabin–Karp rolling hash is a fundamental primitive for efficient string-matching, $n$ -gram indexing, document fingerprinting, and privacy-preserving protocols. Its hallmark is a polynomial hash function that can be updated in constant time as a fixed-length sliding window moves through a string, enabling rapid substring detection and comparison. The method supports both classical implementations for high-throughput analytics and advanced cryptographic applications such as zk-SNARK–based privacy-preserving string matching. The polynomial-based structure also allows for reversible (injective) algorithmic constructions and has direct impact on collision properties and statistical independence.

1. Mathematical Definition and Rolling Update Rule

Let $S$ be a string of length $m$ over an alphabet that is injectively mapped to integers, and fix a base $b$ (often denoted $d$ in some literature) and a prime modulus $q$ (or $p$ ). The Rabin–Karp polynomial hash of $S$ is

$H(S) = \sum_{j=0}^{m-1} S[j]\,b^{m-1-j}\;\bmod q$

or, equivalently,

$H(S) = (S[0]\,b^{m-1} + S[1]\,b^{m-2} + \cdots + S[m-1])\bmod q$

For a window of length $m$ at position $i$ in a longer string $T$ , the hash $H_i$ is computed over $T[i..i+m-1]$ .

The rolling update rule exploits the overlap between consecutive windows:

$H_{i+1} = [\, b\cdot(H_i - T[i]\,b^{m-1}\bmod q) + T[i+m]\,]\bmod q$

where:

The outgoing high-order symbol $T[i]$ weighted by $b^{m-1}$ is subtracted and removed,
The resulting hash is multiplied by $b$ to shift digit positions,
The incoming symbol $T[i+m]$ is added at the lower-order position,
The final sum is reduced modulo $q$ to keep the values bounded.

Precomputation of $h = b^{m-1}\bmod q$ allows for the subtraction step to be computed as $T[i]\cdot h\bmod q$ , improving efficiency (Glück et al., 2022, Li et al., 20 May 2025, 0705.4676).

2. Parameterization, Collision Behavior, and Injectivity

Parameter choice is central to the method's collision resistance, update efficiency, and reversibility:

Base $b$ : Typically a small prime slightly larger than the alphabet (e.g., $b=257$ for ASCII). Larger bases distribute hash values more widely but incur identical modular multiplication costs.
Modulus $q$ : Preference is given to a large prime (e.g., $q\approx 2^{31}-1$ or matching the native field of cryptosystems). Collision probability behaves as $O(n/q)$ , so $q$ on the order of $2^{60}$ or $2^{61}-1$ renders collisions negligible for texts of length up to $10^{12}$ .
Coprimality: $b$ and $q$ are required to satisfy $\gcd(b,q)=1$ , ensuring that modular multiplication is invertible and the rolling hash update is injective—a necessity for reversible computation (Glück et al., 2022).

For reversible implementations, the update

$\phi(x) = [\, b\,(x -_q\,T[i]\,h) +_q\, T[i+m]\,]\bmod q$

is injective under

$0\leq x < q$ ,
$0 < b < q$ and $\gcd(b,q)=1$ ,
$0 \leq T[i], T[i+m] < b$ .

Every modular arithmetic operation ( $-_q$ , $\cdot_q$ , $+_q$ ) is injective in its primary argument under these conditions. Reversibility is achieved without extra space or information loss via in-place, bijective updates (Glück et al., 2022).

3. Algorithmic Implementation and Complexity

A standard Rabin–Karp matcher operates in two phases:

Initialization: Hashes for the pattern and the first text window are computed in $O(m)$ . For hash update, the only precomputed value is $b^{m-1}\bmod q$ .
Sliding: For each shift $i$ , the rolling hash update is performed in $O(1)$ time, with explicit computation as above. Matches on hash equality are validated through a $O(m)$ character-wise check to eliminate false positives due to hash collisions. The overall time complexity is $O(n+m)$ in expectation. In the worst (collision-saturated) case, complexity degrades to $O(nm)$ (Li et al., 20 May 2025, 0705.4676).

Pseudocode for rolling hash update (with precomputed $b^{m-1}\bmod q$ ):

Algorithm SlideHash(old_hash, outgoing_char, incoming_char)
    temp ← old_hash − outgoing_char⋅pow_b_m1
    temp ← temp mod q        // ensure non-negative
    new_hash ← (temp⋅b + incoming_char) mod q
    return new_hash

Memory overhead is

O(1)

, excluding storage of input strings (Li et al., 20 May 2025).

4. Statistical Independence and Limitations

No recursive (rolling) hash scheme, including the Rabin–Karp family, can achieve more than pairwise independence. Specifically, even randomized Karp–Rabin hashes fail to be pairwise independent for $n\geq 2$ : distinct $n$ -grams may collide at rates above the uniform baseline. Only specialized polynomial schemes, notably over $GF(2)[x]/p(x)$ with irreducible $p(x)$ and via independent random symbol mappings, are provably pairwise independent, with cost $O(L)$ per update for $L$ -bit hashes (0705.4676).

Key results:

No 3-wise independence: For any recursive hash family, triple overlapping windows cannot be made fully independent due to the deterministic link introduced by the rolling recurrence.
Uniformity only: Randomized Karp–Rabin schemes can be tuned to produce uniform output, but pairwise (let alone $k$ -wise) independence is unattainable except via computation from scratch with $O(n)$ cost (0705.4676).

Polynomial-based approaches over $GF(2)$ :

General (irreducible) polynomials yield pairwise independence without bit discarding but require either $O(n)$ shifts or $O(2^n)$ memory for buffered operations.
Cyclic polynomials ( $p(x)=x^L+1$ ): Doubly fast, but require the discard of $n-1$ hash bits to recover pairwise independence. The remaining $L-n+1$ bits then serve as a pairwise independent hash (0705.4676).

5. Practical Applications: High-Throughput Matching and Cryptographic Protocols

The Rabin–Karp rolling hash is foundational in substring search and $n$ -gram analytics. It is deployed in applications such as plagiarism detection, document fingerprinting, and frequency analysis due to its $O(1)$ update and low false-positive rate in well-chosen parameter regimes. In cryptographic and privacy-preserving regimes, Rabin–Karp enables efficient zero-knowledge proofs for substring inclusion without text or pattern disclosure, as in zk-SNARK-based constructions (Li et al., 20 May 2025). In such protocols:

The rolling hash function is instantiated within arithmetic circuits native to the proving system, with parameters chosen to match the underlying field, obviating explicit modular reduction.
All internal hash values remain secret, and only compact, verifiable assertions of substring presence are revealed, ensuring both privacy and computational efficiency.

The rolling hash update integrates into circuit design with linear constraint complexity in $n$ , with succinct, zero-knowledge proofs generated for pattern matching—a method that scales efficiently to large input sizes (Li et al., 20 May 2025).

6. Performance Benchmarks and Memory-Throughput Trade-Offs

Comparative benchmarking yields empirical insights:

Randomized Karp–Rabin (ID37): Fastest for uniformity-only tasks (e.g., up to 0.3 s for $100$M $n$ -gram hashes), but has higher collision rates in large-scale tests; not pairwise independent.
Polynomial (General, GF(2) with irreducible $p$ ): Throughput of $3.5$–$4$ million updates per second; pairwise independent; higher memory cost if buffering is used.
Polynomial (Cyclic): Throughput $6$–$8$ million updates per second; requires hash-bit dropping to achieve pairwise independence; memory-efficient (no buffer required).
Non-recursive 3-wise independent schemes: Throughput is $O(n)$ per update; becomes less practical as window size $n$ increases.

Memory overhead is dominated by symbol-to-random-value tables $O(L|\Sigma|)$ ; the cost of exponential buffering arises only for certain polynomial schemes that require $O(2^n)$ storage to keep update complexity minimal (0705.4676).

Hash Family	Independence	Per-update Speed	Memory Cost
Randomized Karp–Rabin	Uniform only	Fastest	Small
Polynomial, General	Pairwise	Moderate	$O(L\|\Sigma\|)+O(2^n)$ (optional)
Polynomial, Cyclic	Pairwise*	Fastest	$O(L\|\Sigma\|)$
Non-recursive 3-wise	$k$ -wise	Slower ( $O(n)$ )	Large for table

*After discarding $n-1$ bits.

7. Worked Example and Integration with Reversible and Privacy-Preserving Algorithms

For illustrative purposes, consider the text “abracadabra” and pattern “abra” with $b=256$ , $q=101$ , $m=4$ :

Precompute: $b^{3}\bmod 101 = 73$ .
$h(\mathrm{"abra"})$ is calculated recursively and found to be $58 \bmod 101$ .
First window: matches directly.
Next window: Compute $58 - 97 \cdot 73 \bmod 101$ , then $45 \cdot 54 + 99 \bmod 101 = 4$ .

In reversible algorithm design, the injectivity of the modular steps allows for clean, space-free construction without irreversible state updates. Within zk-SNARK circuits, parameters ( $b$ , $q$ , $b^{m-1}\bmod q$ ) are hardcoded, the input strings are private, and only the compiled proof of substring presence is revealed (Li et al., 20 May 2025, Glück et al., 2022).

Markdown Report Issue Upgrade to Chat

References (3)

Reversible Programming: A Case Study of Two String-Matching Algorithms (2022)

Zk-SNARK for String Match (2025)

Recursive n-gram hashing is pairwise independent, at best (2007)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rabin-Karp Rolling Hashing.