String Sanitization Under Edit Distance: Improved and Generalized (2007.08179v2)
Abstract: Let $W$ be a string of length $n$ over an alphabet $\Sigma$, $k$ be a positive integer, and $\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\mathrm{ED}}$ such that: (i) no string of $\mathcal{S}$ occurs in $X_{\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $\Sigma$ (and thus the frequency) is the same in $W$ and in $X_{\mathrm{ED}}$; and (iii) $X_{\mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $\mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in $\mathcal{O}(n2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $\mathcal{O}(n{2-\delta})$ time, for any $\delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $\mathcal{O}(n2\log2k)$-time algorithm to solve ETFS; and (ii) an $\mathcal{O}(n2\log2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $\mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.
- Hiding sequential and spatiotemporal patterns. IEEE Trans. Knowl. Data Eng., 22(12):1709–1723, 2010. URL: https://doi.org/10.1109/TKDE.2009.213, doi:10.1109/TKDE.2009.213.
- Knowledge hiding from tree and graph databases. Data Knowl. Eng., 72:148–171, 2012. URL: https://doi.org/10.1016/j.datak.2011.10.002, doi:10.1016/j.datak.2011.10.002.
- Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1-4):195–208, 1987.
- Charu C. Aggarwal. Applications of frequent pattern mining. In Charu C. Aggarwal and Jiawei Han, editors, Frequent Pattern Mining, pages 443–467. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-07821-2_18, doi:10.1007/978-3-319-07821-2_18.
- Mining sequential patterns. In Philip S. Yu and Arbee L. P. Chen, editors, Proceedings of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan, pages 3–14. IEEE Computer Society, 1995. URL: https://doi.org/10.1109/ICDE.1995.380415, doi:10.1109/ICDE.1995.380415.
- On economical construction of the transitive closure of an oriented graph. Doklady Akademii Nauk, 194(3):487–488, 1970.
- String sanitization: A combinatorial approach. In Ulf Brefeld, Élisa Fromont, Andreas Hotho, Arno J. Knobbe, Marloes H. Maathuis, and Céline Robardet, editors, Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2019, Würzburg, Germany, September 16-20, 2019, Proceedings, Part I, volume 11906 of Lecture Notes in Computer Science, pages 627–644. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-46150-8_37, doi:10.1007/978-3-030-46150-8_37.
- Combinatorial algorithms for string sanitization. ACM Trans. Knowl. Discov. Data, 15(1), December 2020. URL: https://doi.org/10.1145/3418683, doi:10.1145/3418683.
- String Sanitization Under Edit Distance. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020), volume 161 of Leibniz International Proceedings in Informatics (LIPIcs), pages 7:1–7:14, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2020/12132, doi:10.4230/LIPIcs.CPM.2020.7.
- Hide and mine in strings: Hardness and algorithms. In Claudia Plant, Haixun Wang, Alfredo Cuzzocrea, Carlo Zaniolo, and Xindong Wu, editors, 20th IEEE International Conference on Data Mining, ICDM 2020, Sorrento, Italy, November 17-20, 2020, pages 924–929. IEEE, 2020. URL: https://doi.org/10.1109/ICDM50108.2020.00103, doi:10.1109/ICDM50108.2020.00103.
- An information-theoretic approach to individual sequential data sanitization. In Paul N. Bennett, Vanja Josifovski, Jennifer Neville, and Filip Radlinski, editors, Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, San Francisco, CA, USA, February 22-25, 2016, pages 337–346. ACM, 2016. URL: https://doi.org/10.1145/2835776.2835828, doi:10.1145/2835776.2835828.
- Quadratic conditional lower bounds for string problems and dynamic time warping. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 79–97. IEEE Computer Society, 2015. URL: https://doi.org/10.1109/FOCS.2015.15, doi:10.1109/FOCS.2015.15.
- A succinct four Russians speedup for edit distance computation and one-against-many banded alignment. In Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
- Mining moving patterns for predicting next location. Inf. Syst., 54:156–168, 2015. URL: https://doi.org/10.1016/j.is.2015.07.001, doi:10.1016/j.is.2015.07.001.
- Security and privacy implications of data mining. In In ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pages 15–19, 1996.
- A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM journal on computing, 32(6):1654–1673, 2003.
- Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3):538–544, 1984. URL: https://doi.org/10.1145/828.1884, doi:10.1145/828.1884.
- Revisiting sequential pattern hiding to enhance utility. In Chid Apté, Joydeep Ghosh, and Padhraic Smyth, editors, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21-24, 2011, pages 1316–1324. ACM, 2011. URL: https://doi.org/10.1145/2020408.2020605, doi:10.1145/2020408.2020605.
- An integer programming approach for frequent itemset hiding. In Philip S. Yu, Vassilis J. Tsotras, Edward A. Fox, and Bing Liu, editors, Proceedings of the 2006 ACM CIKM International Conference on Information and Knowledge Management, Arlington, Virginia, USA, November 6-11, 2006, pages 748–757. ACM, 2006. URL: https://doi.org/10.1145/1183614.1183721, doi:10.1145/1183614.1183721.
- Exact knowledge hiding through database extension. IEEE Trans. Knowl. Data Eng., 21(5):699–713, 2009. URL: https://doi.org/10.1109/TKDE.2008.199, doi:10.1109/TKDE.2008.199.
- Permutation-based sequential pattern hiding. In Hui Xiong, George Karypis, Bhavani M. Thuraisingham, Diane J. Cook, and Xindong Wu, editors, 2013 IEEE 13th International Conference on Data Mining, Dallas, TX, USA, December 7-10, 2013, pages 241–250. IEEE Computer Society, 2013. URL: https://doi.org/10.1109/ICDM.2013.57, doi:10.1109/ICDM.2013.57.
- On the complexity of k-sat. J. Comput. Syst. Sci., 62(2):367–375, 2001. URL: https://doi.org/10.1006/jcss.2000.1727, doi:10.1006/jcss.2000.1727.
- Which problems have strongly exponential complexity? J. Comput. Syst. Sci., 63(4):512–530, 2001. URL: https://doi.org/10.1006/jcss.2001.1774, doi:10.1006/jcss.2001.1774.
- Philip N Klein. Multiple-source shortest paths in planar graphs. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 146–155. Society for Industrial and Applied Mathematics, 2005.
- The next-generation sequencing revolution and its impact on genomics. Cell, 155(1):27–38, 2013.
- Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10:707, 1966.
- Optimal event sequence sanitization. In Suresh Venkatasubramanian and Jieping Ye, editors, Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, April 30 - May 2, 2015, pages 775–783. SIAM, 2015. URL: https://doi.org/10.1137/1.9781611974010.87, doi:10.1137/1.9781611974010.87.
- Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51(1):5–37, 1989.
- Stanley R. M. Oliveira and Osmar R. Zaïane. Protecting sensitive knowledge by data sanitization. In Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), 19-22 December 2003, Melbourne, Florida, USA, pages 613–616. IEEE Computer Society, 2003. URL: https://doi.org/10.1109/ICDM.2003.1250990, doi:10.1109/ICDM.2003.1250990.
- Jeanette P Schmidt. All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings. SIAM Journal on Computing, 27(4):972–992, 1998.
- Association rule hiding. IEEE Trans. Knowl. Data Eng., 16(4):434–447, 2004. URL: https://doi.org/10.1109/TKDE.2004.1269668, doi:10.1109/TKDE.2004.1269668.
- Hiding sensitive association rules with limited side effects. IEEE Trans. Knowl. Data Eng., 19(1):29–42, 2007. URL: https://doi.org/10.1109/TKDE.2007.250583, doi:10.1109/TKDE.2007.250583.
- Semantic trajectory mining for location prediction. In Isabel F. Cruz, Divyakant Agrawal, Christian S. Jensen, Eyal Ofek, and Egemen Tanin, editors, 19th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, ACM-GIS 2011, November 1-4, 2011, Chicago, IL, USA, Proceedings, pages 34–43. ACM, 2011. URL: https://doi.org/10.1145/2093973.2093980, doi:10.1145/2093973.2093980.