Pattern Masking for Dictionary Matching (2006.16137v2)
Abstract: In the Pattern Masking for Dictionary Matching (PMDM) problem, we are given a dictionary $\mathcal{D}$ of $d$ strings, each of length $\ell$, a query string $q$ of length $\ell$, and a positive integer $z$, and we are asked to compute a smallest set $K\subseteq{1,\ldots,\ell}$, so that if $q[i]$, for all $i\in K$, is replaced by a wildcard, then $q$ matches at least $z$ strings from $\mathcal{D}$. The PMDM problem lies at the heart of two important applications featured in large-scale real-world systems: record linkage of databases that contain sensitive information, and query term dropping. In both applications, solving PMDM allows for providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known $k$-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We present a data structure for PMDM that answers queries over $\mathcal{D}$ in time $\mathcal{O}(2{\ell/2}(2{\ell/2}+\tau)\ell)$ and requires space $\mathcal{O}(2{\ell}d2/\tau2+2{\ell/2}d)$, for any parameter $\tau\in[1,d]$. We also approach the problem from a more practical perspective. We show an $\mathcal{O}((d\ell){k/3}+d\ell)$-time and $\mathcal{O}(d\ell)$-space algorithm for PMDM if $k=|K|=\mathcal{O}(1)$. We generalize our exact algorithm to mask multiple query strings simultaneously. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamt\'{a}\v{c} et al., SODA 2017]. This gives a polynomial-time $\mathcal{O}(d{1/4+\epsilon})$-approximation algorithm for PMDM, which is tight under plausible complexity conjectures.
- IBM Synthetic Data Generator for Itemsets and Sequences. https://github.com/zakimjz/IBMGenerator, September 2020.
- North Carolina Voter Registration database (dataset ncvoter_Statewide.zip). https://dl.ncsbe.gov/?prefix=data/, September 2020.
- Secure critical data with Oracle Data Safe (white paper). https://www.oracle.com/a/tech/docs/dbsec/data-safe/wp-security-data-safe.pdf, September 2020.
- If the current clique algorithms are optimal, so is Valiant’s parser. SIAM Journal on Computing, 47(6):2527–2555, 2018. doi:10.1137/16M1061771.
- Data structure lower bounds for document indexing problems. In 43rd International Colloquium on Automata, Languages and Programming (ICALP 2016), volume 55 of LIPIcs, pages 93:1–93:15, 2016. doi:10.4230/LIPIcs.ICALP.2016.93.
- Efficient string matching: An aid to bibliographic search. Communications of the ACM, 18(6):333–340, 1975. doi:10.1145/360825.360855.
- A refined laser method and faster matrix multiplication. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, pages 522–539. SIAM, 2021. doi:10.1137/1.9781611976465.32.
- Benny Applebaum. Pseudorandom generators with long stretch and low locality from random local one-way functions. SIAM Journal on Computing, 42(5):2008–2037, 2013. doi:10.1137/120884857.
- An efficient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. Journal of Combinatorial Optimization, 13(3):243–262, 2007. doi:10.1007/s10878-006-9029-1.
- Estimating confidence for query revision models, U.S. Patent US7617205B2 (granted to Google), 2009.
- How well do automated linking methods perform? Lessons from U.S. historical data. NBER Working Papers 24019, National Bureau of Economic Research, Inc, 2017. doi:10.3386/w24019.
- Masking patterns in sequences: A new class of motif discovery with don’t cares. Theoretical Computer Science, 410(43):4327–4340, 2009. doi:10.1016/j.tcs.2009.07.014.
- Djamal Belazzougui. Faster and space-optimal edit distance "1" dictionary. In 20th Annual Symposium on Combinatorial Pattern Matching (CPM 2009), volume 5577 of Lecture Notes in Computer Science, pages 154–167. Springer, 2009. doi:10.1007/978-3-642-02441-2_14.
- Compressed string dictionary search with edit distance one. Algorithmica, 74(3):1099–1122, 2016. doi:10.1007/s00453-015-9990-0.
- String indexing for patterns with wildcards. Theory of Computing Systems, 55(1):41–60, 2014. doi:10.1007/s00224-013-9498-4.
- Lower bounds for high dimensional nearest neighbor search and related problems. In 31st ACM Symposium on Theory of Computing (STOC 1999), pages 312–321, 1999. doi:10.1145/301250.301330.
- Improved bounds for dictionary look-up with one error. Information Processing Letters, 75(1-2):57–59, 2000. doi:10.1016/S0020-0190(00)00079-X.
- The complexity of satisfiability of small depth circuits. In Parameterized and Exact Computation, 4th International Workshop (IWPEC 2009), volume 5917 of Lecture Notes in Computer Science, pages 75–85. Springer, 2009. doi:10.1007/978-3-642-11269-0_6.
- Compressed indexes for approximate string matching. Algorithmica, 58(2):263–281, 2010. doi:10.1007/s00453-008-9263-2.
- Pattern Masking for Dictionary Matching. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021), volume 212, pages 65:1–65:19, Dagstuhl, Germany, 2021. doi:10.4230/LIPIcs.ISAAC.2021.65.
- New algorithms for subset query, partial match, orthogonal range searching, and related problems. In 29th International Colloquium on Automata, Languages and Programming (ICALP 2002), pages 451–462, 2002. doi:10.1007/3-540-45465-9_39.
- Strong computational lower bounds via parameterized complexity. Journal of Computer and System Sciences, 72(8):1346–1367, 2006. doi:10.1016/j.jcss.2006.04.007.
- The densest k-subhypergraph problem. SIAM Journal on Discrete Mathematics, 32(2):1458–1477, 2018. doi:10.1137/16M1096402.
- Minimizing the union: Tight approximations for small set bipartite vertex expansion. In 28th ACM-SIAM Symposium on Discrete Algorithms (SODA 2017), pages 881–899, 2017. doi:10.1137/1.9781611974782.56.
- Peter Christen. Data Matching – Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, Heidelberg, 2012. doi:10.1007/978-3-642-31164-2.
- Automatic discovery of abnormal values in large textual databases. J. Data and Information Quality, 7(1–2), 2016. doi:10.1145/2889311.
- Linking Sensitive Data. Springer, Heidelberg, 2020. doi:https://doi.org/10.1007/978-3-030-59706-1.
- Lower bounds for text indexing with mismatches and differences. In 30th ACM-SIAM Symposium on Discrete Algorithms (SODA 2019), pages 1146–1164, 2019. doi:10.1137/1.9781611975482.70.
- Dictionary matching and indexing with errors and don’t cares. In 36th ACM Symposium on Theory of Computing (STOC 2004), pages 91–100, 2004. doi:10.1145/1007352.1007374.
- Data masking techniques for nosql database security: A systematic review. In 2017 IEEE International Conference on Big Data (BigData 2017), pages 4467–4473, 2017. doi:10.1109/BigData.2017.8258486.
- Parameterized Algorithms. Springer, 2015. URL: https://doi.org/10.1007/978-3-319-21275-3, doi:10.1007/978-3-319-21275-3.
- Efficient mining of closed repetitive gapped subsequences from a sequence database. In 25th IEEE International Conference on Data Engineering (ICDE), pages 1024–1035, 2009. doi:10.1109/ICDE.2009.104.
- Composite bloom filters for secure record linkage. IEEE Transactions on Knowledge and Data Engineering, 26(12):2956–2968, 2014. doi:10.1109/TKDE.2013.91.
- Suffix tree characterization of maximal motifs in biological sequences. Theor. Comput. Sci., 410(43):4391–4401, 2009. doi:10.1016/J.TCS.2009.07.020.
- Storing a sparse table with O(1)𝑂1{O}(1)italic_O ( 1 ) worst case access time. Journal of the ACM, 31(3):538–544, 1984. doi:10.1145/828.1884.
- Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv., 42(4):14:1–14:53, 2010. URL: https://doi.org/10.1145/1749603.1749605, doi:10.1145/1749603.1749605.
- François Le Gall. Powers of tensors and fast matrix multiplication. In International Symposium on Symbolic and Algebraic Computation, ISSAC 2014, pages 296–303. ACM, 2014. doi:10.1145/2608628.2608664.
- Weighted ancestors in suffix trees. In Algorithms - 22th Annual European Symposium (ESA 2014), volume 8737 of Lecture Notes in Computer Science, pages 455–466. Springer, 2014. doi:10.1007/978-3-662-44777-2_38.
- Efficient query rewrite for structured web queries. In 20th ACM International Conference on Information and Knowledge Management (CIKM 2011), pages 2417–2420, 2011. doi:10.1145/2063576.2063981.
- Motif trie: An efficient text index for pattern discovery with don’t cares. Theoretical Computer Science, 710:74–87, 2018. doi:10.1016/j.tcs.2017.04.012.
- MADMX: A strategy for maximal dense motif extraction. J. Comput. Biol., 18(4):535–545, 2011. doi:10.1089/CMB.2010.0177.
- Johan Hastad. Clique is hard to approximate within n1−ϵsuperscript𝑛1italic-ϵn^{1-\epsilon}italic_n start_POSTSUPERSCRIPT 1 - italic_ϵ end_POSTSUPERSCRIPT. Acta Mathematica, 182:105–142, 1999. doi:10.1007/BF02392825.
- Data quality and record linkage techniques. Springer, 2007.
- General algorithms for mining closed flexible patterns under various equivalence relations. In Machine Learning and Knowledge Discovery in Databases, pages 435–450, 2012. doi:10.1007/978-3-642-33486-3_28.
- A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys, 40(4), 2008. doi:10.1145/1391729.1391730.
- On the complexity of k-SAT. Journal of Computer and System Sciences, 62(2):367–375, 2001. doi:10.1006/jcss.2000.1727.
- Cell-probe lower bounds for the partial match problem. Journal of Computer and System Sciences, 69(3):435–447, 2004. doi:10.1016/j.jcss.2004.04.006.
- Summarizing and linking electronic health records. Distributed and Parallel Databases, pages 1–40, 2019. doi:10.1007/s10619-019-07263-0.
- Richard M. Karp. Reducibility among combinatorial problems. In 50 Years of Integer Programming 1958-2008 - From the Early Years to the State-of-the-Art, pages 219–241. Springer, 2010. doi:10.1007/978-3-540-68279-0_8.
- The Multiple-Choice Knapsack Problem, pages 317–347. Springer Berlin Heidelberg, 2004. doi:10.1007/978-3-540-24777-7_11.
- Technical perspective: Toward building entity matching management systems. SIGMOD Record, 47(1):33–40, 2018. doi:10.1145/3277006.3277015.
- Privacy preserving interactive record linkage (PPIRL). Journal of the American Medical Informatics Association, 21(2):212–220, 2014. doi:10.1136/amiajnl-2013-002165.
- Enhancing privacy through an interactive on-demand incremental information disclosure interface: Applying privacy-by-design to record linkage. In Fifteenth USENIX Conference on Usable Privacy and Security, pages 175–189, 2019. doi:10.5555/3361476.3361489.
- Systems and methods for generating search query rewrites, U.S. Patent US10108712B2 (granted to ebay), 2018.
- Less space: Indexing for queries with wildcards. Theoretical Computer Science, 557:120–127, 2014. doi:10.1016/j.tcs.2014.09.003.
- Space-efficient string indexing for wildcard pattern matching. In 31st Symposium on Theoretical Aspects of Computer Science (STACS 2014), pages 506–517, 2014. doi:10.4230/LIPIcs.STACS.2014.506.
- Tight hardness for shortest cycles and paths in sparse graphs. In 29th ACM-SIAM Symposium on Discrete Algorithms (SODA 2018), pages 1236–1252, 2018. doi:10.1137/1.9781611975031.80.
- On data structures and asymmetric communication complexity. Journal of Computer and System Sciences, 57(1):37–49, 1998. doi:10.1006/jcss.1998.1577.
- Blocking and filtering techniques for entity resolution: A survey. ACM Computing Surveys, 53(2), 2020. doi:10.1145/3377455.
- A basis of tiling motifs for generating repeated patterns and its complexity for higher quorum. In 28th International Symposium on Mathematical Foundations of Computer Science 2003 (MFCS), volume 2747 of Lecture Notes in Computer Science, pages 622–631. Springer, 2003. doi:10.1007/978-3-540-45138-9_56.
- Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(1):40–50, 2005. doi:10.1109/TCBB.2005.5.
- Mihai Pǎtraşcu. Unifying the landscape of cell-probe lower bounds. SIAM Journal on Computing, 40(3):827–847, 2011. doi:10.1137/09075336X.
- Higher lower bounds for near-neighbor and further rich problems. SIAM Journal on Computing, 39(2):730–741, 2009. doi:10.1137/070684859.
- Balancing privacy and information disclosure in interactive record linkage with visual masking. In ACM Conference on Human Factors in Computing Systems (CHI 2018), 2018. doi:10.1145/3173574.3173900.
- Ronald L. Rivest. Partial-match retrieval algorithms. SIAM Journal on Computing, 5(1):19–50, 1976. doi:10.1137/0205003.
- Pierangela Samarati. Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng., 13(6):1010–1027, 2001. URL: https://doi.org/10.1109/69.971193, doi:10.1109/69.971193.
- Generalizing data to provide anonymity when disclosing information (abstract). In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 1998), page 188. Association for Computing Machinery, 1998. doi:10.1145/275487.275508.
- Protecting privacy when disclosing information: k𝑘kitalic_k-anonymity and its enforcement through generalization and suppression. Technical report, Computer Science Laboratory, SRI International, 1998.
- A data masking technique for data warehouses. In 15th International Database Engineering and Applications Symposium (IDEAS 2011), pages 61–69, 2011. doi:10.1145/2076623.2076632.
- Latanya Sweeney. Computational disclosure control: a primer on data privacy protection. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2001. URL: http://hdl.handle.net/1721.1/8589.
- Query rewrite for null and low search results in ecommerce. In SIGIR Workshop On eCommerce, volume 2311 of CEUR Workshop Proceedings, 2017.
- Yufei Tao. Entity matching with active monotone classification. In 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2018), pages 49–62, 2018. doi:10.1145/3196959.3196984.
- Scalable privacy-preserving record linkage for multiple databases. In 23rd ACM International Conference on Information and Knowledge Management (CIKM 2014), pages 1795–1798, 2014. doi:10.1145/2661829.2661875.
- Privacy-preserving record linkage for Big Data: Current approaches and research challenges. In Handbook of Big Data Technologies, pages 851–895. Springer, 2017. doi:10.1007/978-3-319-49340-4.
- Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1–11. IEEE Computer Society, 1973. doi:10.1109/SWAT.1973.13.
- Virginia Vassilevska Williams. On some fine-grained questions in algorithms and complexity. In 2018 International Congress of Mathematicians (ICM), pages 3447–3487, 2019. doi:10.1142/9789813272880_0188.
- Dictionary look-up with one error. Journal of Algorithms, 25(1):194–202, 1997. doi:10.1006/jagm.1997.0875.
- David Zuckerman. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory of Computing, 3(1):103–128, 2007. doi:10.4086/toc.2007.v003a006.