Optimal byte partitioning for Parachute string fingerprints

Construct an optimal partition of the byte space [0, 255] into k clusters for the BytePartitioning objective—given a list of UTF-8–encoded strings used by Parachute’s string-fingerprint scheme—so that the resulting pbw-bit fingerprints (which set cluster-bits for bytes present in each string) have the minimal number of ones, thereby minimizing false positives when translating SQL LIKE predicates.

Background

Parachute supports high-cardinality string columns by encoding each string into a compact pbw-bit fingerprint: the UTF-8 byte space [0, 255] is partitioned into k clusters (where k = pbw), and a bit is set for every cluster that contains a byte appearing in the string. This enables translating SQL LIKE patterns into subset checks on these fingerprints while avoiding false negatives.

Selectivity depends on how the bytes are partitioned into clusters: if many bytes map to the same cluster, fingerprints gain more ones and lose pruning power. The authors formalize this as the BytePartitioning problem: given a corpus of strings and a cluster count k, partition [0, 255] into k clusters to minimize the number of ones in the fingerprints. They currently use a simple round-robin assignment and explicitly defer finding the optimal partitioning to future work.

References

Given a list of [0, 255]-valued strings and a parameter k, partition the byte space into k clusters such that the number of ones in the fingerprints is minimized. We leave it as future work to build these partitions optimally. Currently, we assume a uniform distribution over the bytes, i.e., we use a Round-Robin strategy to build the partitions.

Parachute: Single-Pass Bi-Directional Information Passing (2506.13670 - Stoian et al., 16 Jun 2025) in Section 3.2, Supporting String Columns (Byte Partitions), Definition (BytePartitioning)