Papers
Topics
Authors
Recent
Search
2000 character limit reached

Rabin–Karp Rolling Hashing

Updated 3 February 2026
  • Rabin–Karp rolling hashing is a technique that uses a polynomial hash function updated in constant time as a sliding window moves through the string.
  • It carefully selects parameters like base and modulus to optimize collision resistance, injectivity, and performance for tasks such as document fingerprinting and cryptographic applications.
  • Its efficiency (average O(n+m) time) and reversible design facilitate high-throughput analytics and integration in zero-knowledge, privacy-preserving proofs.

The Rabin–Karp rolling hash is a fundamental primitive for efficient string-matching, nn-gram indexing, document fingerprinting, and privacy-preserving protocols. Its hallmark is a polynomial hash function that can be updated in constant time as a fixed-length sliding window moves through a string, enabling rapid substring detection and comparison. The method supports both classical implementations for high-throughput analytics and advanced cryptographic applications such as zk-SNARK–based privacy-preserving string matching. The polynomial-based structure also allows for reversible (injective) algorithmic constructions and has direct impact on collision properties and statistical independence.

1. Mathematical Definition and Rolling Update Rule

Let SS be a string of length mm over an alphabet that is injectively mapped to integers, and fix a base bb (often denoted dd in some literature) and a prime modulus qq (or pp). The Rabin–Karp polynomial hash of SS is

H(S)=j=0m1S[j]bm1j  modqH(S) = \sum_{j=0}^{m-1} S[j]\,b^{m-1-j}\;\bmod q

or, equivalently,

H(S)=(S[0]bm1+S[1]bm2++S[m1])modqH(S) = (S[0]\,b^{m-1} + S[1]\,b^{m-2} + \cdots + S[m-1])\bmod q

For a window of length mm at position ii in a longer string TT, the hash HiH_i is computed over T[i..i+m1]T[i..i+m-1].

The rolling update rule exploits the overlap between consecutive windows:

Hi+1=[b(HiT[i]bm1modq)+T[i+m]]modqH_{i+1} = [\, b\cdot(H_i - T[i]\,b^{m-1}\bmod q) + T[i+m]\,]\bmod q

where:

  • The outgoing high-order symbol T[i]T[i] weighted by bm1b^{m-1} is subtracted and removed,
  • The resulting hash is multiplied by bb to shift digit positions,
  • The incoming symbol T[i+m]T[i+m] is added at the lower-order position,
  • The final sum is reduced modulo qq to keep the values bounded.

Precomputation of h=bm1modqh = b^{m-1}\bmod q allows for the subtraction step to be computed as T[i]hmodqT[i]\cdot h\bmod q, improving efficiency (Glück et al., 2022, Li et al., 20 May 2025, 0705.4676).

2. Parameterization, Collision Behavior, and Injectivity

Parameter choice is central to the method's collision resistance, update efficiency, and reversibility:

  • Base bb: Typically a small prime slightly larger than the alphabet (e.g., b=257b=257 for ASCII). Larger bases distribute hash values more widely but incur identical modular multiplication costs.
  • Modulus qq: Preference is given to a large prime (e.g., q2311q\approx 2^{31}-1 or matching the native field of cryptosystems). Collision probability behaves as O(n/q)O(n/q), so qq on the order of 2602^{60} or 26112^{61}-1 renders collisions negligible for texts of length up to 101210^{12}.
  • Coprimality: bb and qq are required to satisfy gcd(b,q)=1\gcd(b,q)=1, ensuring that modular multiplication is invertible and the rolling hash update is injective—a necessity for reversible computation (Glück et al., 2022).

For reversible implementations, the update

ϕ(x)=[b(xqT[i]h)+qT[i+m]]modq\phi(x) = [\, b\,(x -_q\,T[i]\,h) +_q\, T[i+m]\,]\bmod q

is injective under

  • 0x<q0\leq x < q,
  • $0 < b < q$ and gcd(b,q)=1\gcd(b,q)=1,
  • 0T[i],T[i+m]<b0 \leq T[i], T[i+m] < b.

Every modular arithmetic operation (q-_q, q\cdot_q, +q+_q) is injective in its primary argument under these conditions. Reversibility is achieved without extra space or information loss via in-place, bijective updates (Glück et al., 2022).

3. Algorithmic Implementation and Complexity

A standard Rabin–Karp matcher operates in two phases:

  1. Initialization: Hashes for the pattern and the first text window are computed in O(m)O(m). For hash update, the only precomputed value is bm1modqb^{m-1}\bmod q.
  2. Sliding: For each shift ii, the rolling hash update is performed in O(1)O(1) time, with explicit computation as above. Matches on hash equality are validated through a O(m)O(m) character-wise check to eliminate false positives due to hash collisions. The overall time complexity is O(n+m)O(n+m) in expectation. In the worst (collision-saturated) case, complexity degrades to O(nm)O(nm) (Li et al., 20 May 2025, 0705.4676).

Pseudocode for rolling hash update (with precomputed bm1modqb^{m-1}\bmod q):

1
2
3
4
5
Algorithm SlideHash(old_hash, outgoing_char, incoming_char)
    temp ← old_hash − outgoing_char⋅pow_b_m1
    temp ← temp mod q        // ensure non-negative
    new_hash ← (temp⋅b + incoming_char) mod q
    return new_hash
Memory overhead is O(1)O(1), excluding storage of input strings (Li et al., 20 May 2025).

4. Statistical Independence and Limitations

No recursive (rolling) hash scheme, including the Rabin–Karp family, can achieve more than pairwise independence. Specifically, even randomized Karp–Rabin hashes fail to be pairwise independent for n2n\geq 2: distinct nn-grams may collide at rates above the uniform baseline. Only specialized polynomial schemes, notably over GF(2)[x]/p(x)GF(2)[x]/p(x) with irreducible p(x)p(x) and via independent random symbol mappings, are provably pairwise independent, with cost O(L)O(L) per update for LL-bit hashes (0705.4676).

Key results:

  • No 3-wise independence: For any recursive hash family, triple overlapping windows cannot be made fully independent due to the deterministic link introduced by the rolling recurrence.
  • Uniformity only: Randomized Karp–Rabin schemes can be tuned to produce uniform output, but pairwise (let alone kk-wise) independence is unattainable except via computation from scratch with O(n)O(n) cost (0705.4676).

Polynomial-based approaches over GF(2)GF(2):

  • General (irreducible) polynomials yield pairwise independence without bit discarding but require either O(n)O(n) shifts or O(2n)O(2^n) memory for buffered operations.
  • Cyclic polynomials (p(x)=xL+1p(x)=x^L+1): Doubly fast, but require the discard of n1n-1 hash bits to recover pairwise independence. The remaining Ln+1L-n+1 bits then serve as a pairwise independent hash (0705.4676).

5. Practical Applications: High-Throughput Matching and Cryptographic Protocols

The Rabin–Karp rolling hash is foundational in substring search and nn-gram analytics. It is deployed in applications such as plagiarism detection, document fingerprinting, and frequency analysis due to its O(1)O(1) update and low false-positive rate in well-chosen parameter regimes. In cryptographic and privacy-preserving regimes, Rabin–Karp enables efficient zero-knowledge proofs for substring inclusion without text or pattern disclosure, as in zk-SNARK-based constructions (Li et al., 20 May 2025). In such protocols:

  • The rolling hash function is instantiated within arithmetic circuits native to the proving system, with parameters chosen to match the underlying field, obviating explicit modular reduction.
  • All internal hash values remain secret, and only compact, verifiable assertions of substring presence are revealed, ensuring both privacy and computational efficiency.

The rolling hash update integrates into circuit design with linear constraint complexity in nn, with succinct, zero-knowledge proofs generated for pattern matching—a method that scales efficiently to large input sizes (Li et al., 20 May 2025).

6. Performance Benchmarks and Memory-Throughput Trade-Offs

Comparative benchmarking yields empirical insights:

  • Randomized Karp–Rabin (ID37): Fastest for uniformity-only tasks (e.g., up to 0.3 s for $100$M nn-gram hashes), but has higher collision rates in large-scale tests; not pairwise independent.
  • Polynomial (General, GF(2) with irreducible pp): Throughput of $3.5$–$4$ million updates per second; pairwise independent; higher memory cost if buffering is used.
  • Polynomial (Cyclic): Throughput $6$–$8$ million updates per second; requires hash-bit dropping to achieve pairwise independence; memory-efficient (no buffer required).
  • Non-recursive 3-wise independent schemes: Throughput is O(n)O(n) per update; becomes less practical as window size nn increases.

Memory overhead is dominated by symbol-to-random-value tables O(LΣ)O(L|\Sigma|); the cost of exponential buffering arises only for certain polynomial schemes that require O(2n)O(2^n) storage to keep update complexity minimal (0705.4676).

Hash Family Independence Per-update Speed Memory Cost
Randomized Karp–Rabin Uniform only Fastest Small
Polynomial, General Pairwise Moderate O(LΣ)+O(2n)O(L|\Sigma|)+O(2^n) (optional)
Polynomial, Cyclic Pairwise* Fastest O(LΣ)O(L|\Sigma|)
Non-recursive 3-wise kk-wise Slower (O(n)O(n)) Large for table

*After discarding n1n-1 bits.

7. Worked Example and Integration with Reversible and Privacy-Preserving Algorithms

For illustrative purposes, consider the text “abracadabra” and pattern “abra” with b=256b=256, q=101q=101, m=4m=4:

  • Precompute: b3mod101=73b^{3}\bmod 101 = 73.
  • h("abra")h(\mathrm{"abra"}) is calculated recursively and found to be 58mod10158 \bmod 101.
  • First window: matches directly.
  • Next window: Compute 589773mod10158 - 97 \cdot 73 \bmod 101, then 4554+99mod101=445 \cdot 54 + 99 \bmod 101 = 4.

In reversible algorithm design, the injectivity of the modular steps allows for clean, space-free construction without irreversible state updates. Within zk-SNARK circuits, parameters (bb, qq, bm1modqb^{m-1}\bmod q) are hardcoded, the input strings are private, and only the compiled proof of substring presence is revealed (Li et al., 20 May 2025, Glück et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Rabin-Karp Rolling Hashing.