Optimal byte partitioning for Parachute string fingerprints
Construct an optimal partition of the byte space [0, 255] into k clusters for the BytePartitioning objective—given a list of UTF-8–encoded strings used by Parachute’s string-fingerprint scheme—so that the resulting pbw-bit fingerprints (which set cluster-bits for bytes present in each string) have the minimal number of ones, thereby minimizing false positives when translating SQL LIKE predicates.
Sponsor
References
Given a list of [0, 255]-valued strings and a parameter k, partition the byte space into k clusters such that the number of ones in the fingerprints is minimized. We leave it as future work to build these partitions optimally. Currently, we assume a uniform distribution over the bytes, i.e., we use a Round-Robin strategy to build the partitions.