Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Space-Efficient Indexes for Uncertain Strings (2403.14256v1)

Published 21 Mar 2024 in cs.DS and cs.DB

Abstract: Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string $X$ of length $n$ on an alphabet $\Sigma$ is a sequence of $n$ probability distributions over $\Sigma$. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities of the letters of $P$ at positions $i,\ldots,i+|P|-1$ is at least $\frac{1}{z}$. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has $\mathcal{O}(nz)$ size, requires $\mathcal{O}(nz)$ time and $\mathcal{O}(nz)$ space to be constructed, and answers pattern matching queries in the optimal $\mathcal{O}(m+|\text{Occ}|)$ time, where $m$ is the length of $P$ and $|\text{Occ}|$ is the total number of occurrences of $P$ in $X$. For large $n$ and (moderate) $z$ values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We propose an index of $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected size, which can be constructed using $\mathcal{O}(\frac{nz}{\ell}\log z)$ expected space, and supports very fast pattern matching queries in expectation, for patterns of length $m\geq \ell$. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3.
  2. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8363274/bin/elife-66857-supp2.txt.
  3. https://www.ncbi.nlm.nih.gov/nuccore/CP003351.
  4. https://github.com/francesccoll/powerbacgwas/blob/main/data/efm_clade_all.vcf.gz.
  5. https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.13/.
  6. https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.
  7. Nearest-neighbor searching under uncertainty II. ACM Trans. Algorithms, 13(1):3:1–3:25, 2016.
  8. Indexing uncertain data. In Jan Paredaens and Jianwen Su, editors, Proceedings of the Twenty-Eigth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2009, June 19 - July 1, 2009, Providence, Rhode Island, USA, pages 137–146. ACM, 2009.
  9. Charu C. Aggarwal. On unifying privacy and uncertain data models. In Proceedings of the 24th International Conference on Data Engineering (ICDE), pages 386–395. IEEE Computer Society, 2008.
  10. Charu C. Aggarwal. Managing and Mining Uncertain Data, volume 35 of Advances in Database Systems. Kluwer, 2009.
  11. Charu C. Aggarwal, editor. Managing and Mining Sensor Data. Springer, 2013.
  12. On avoided words, absent words, and their application to biological sequence analysis. Algorithms Mol. Biol., 12(1):5:1–5:12, 2017.
  13. Property matching and weighted matching. In Moshe Lewenstein and Gabriel Valiente, editors, Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings, volume 4009 of Lecture Notes in Computer Science, pages 188–199. Springer, 2006.
  14. Property matching and weighted matching. Theor. Comput. Sci., 395(2-3):298–310, 2008.
  15. Improving retrieval performance for verbose queries via axiomatic analysis of term discrimination heuristic. In Noriko Kando, Tetsuya Sakai, Hideo Joho, Hang Li, Arjen P. de Vries, and Ryen W. White, editors, Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Shinjuku, Tokyo, Japan, August 7-11, 2017, pages 1201–1204. ACM, 2017.
  16. Text indexing for long patterns: Anchors are all you need. Proc. VLDB Endow., 16(9):2117–2131, 2023.
  17. Indexing weighted sequences: Neat and efficient. Inf. Comput., 270, 2020.
  18. Efficient index for weighted sequences. In Roberto Grossi and Moshe Lewenstein, editors, 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, volume 54 of LIPIcs, pages 4:1–4:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016.
  19. Discovering key concepts in verbose queries. In Sung-Hyon Myaeng, Douglas W. Oard, Fabrizio Sebastiani, Tat-Seng Chua, and Mun-Kew Leong, editors, Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, Singapore, July 20-24, 2008, pages 491–498. ACM, 2008.
  20. Probabilistic threshold indexing for uncertain strings. In Evaggelia Pitoura, Sofian Maabout, Georgia Koutrika, Amélie Marian, Letizia Tanca, Ioana Manolescu, and Kostas Stefanidis, editors, Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, March 15-16, 2016, Bordeaux, France, March 15-16, 2016, pages 401–412. OpenProceedings.org, 2016.
  21. FSST: fast random access string compression. Proc. VLDB Endow., 13(11):2649–2661, 2020.
  22. Orthogonal range searching on the RAM, revisited. In Ferran Hurtado and Marc J. van Kreveld, editors, Proceedings of the 27th ACM Symposium on Computational Geometry, Paris, France, June 13-15, 2011, pages 1–10. ACM, 2011.
  23. Property suffix array with applications in indexing weighted sequences. ACM J. Exp. Algorithmics, 25:1–16, 2020.
  24. On-line weighted pattern matching. Inf. Comput., 266:49–59, 2019.
  25. Indexing metric uncertain data for range queries and range joins. VLDB J., 26(4):585–610, 2017.
  26. Evaluating probability threshold k-nearest-neighbor queries over uncertain data. In Martin L. Kersten, Boris Novikov, Jens Teubner, Vladimir Polutin, and Stefan Manegold, editors, EDBT 2009, 12th International Conference on Extending Database Technology, Saint Petersburg, Russia, March 24-26, 2009, Proceedings, volume 360 of ACM International Conference Proceeding Series, pages 672–683. ACM, 2009.
  27. Querying imprecise data in moving object environments. IEEE Trans. Knowl. Data Eng., 16(9):1112–1127, 2004.
  28. Efficient indexing methods for probabilistic threshold queries over uncertain data. In Mario A. Nascimento, M. Tamer Özsu, Donald Kossmann, Renée J. Miller, José A. Blakeley, and K. Bernhard Schiefer, editors, (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, VLDB 2004, Toronto, Canada, August 31 - September 3 2004, pages 876–887. Morgan Kaufmann, 2004.
  29. PowerBacGWAS: a computational pipeline to perform power calculations for bacterial genome-wide association studies. Communications Biology, 5(266), 2022.
  30. Introduction to Algorithms, 3rd Edition. MIT Press, 2009.
  31. Algorithms on strings. Cambridge University Press, 2007.
  32. Probabilistic spatial queries on existentially uncertain data. In Claudia Bauzer Medeiros, Max J. Egenhofer, and Elisa Bertino, editors, Advances in Spatial and Temporal Databases, 9th International Symposium, SSTD 2005, Angra dos Reis, Brazil, August 22-24, 2005, Proceedings, volume 3633 of Lecture Notes in Computer Science, pages 400–417. Springer, 2005.
  33. Practical performance of space efficient data structures for longest common extensions. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders, editors, 28th Annual European Symposium on Algorithms, ESA 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 173 of LIPIcs, pages 39:1–39:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
  34. The International Genome Sample Resource (IGSR) collection of open human genomic variation resources. Nucleic Acids Research, 48(D1):D941–D947, 10 2019.
  35. Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997, pages 137–143, 1997.
  36. Indexing compressed text. J. ACM, 52(4):552–581, 2005.
  37. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1–2:54, 2020.
  38. Approximate substring matching over uncertain strings. Proc. VLDB Endow., 4(11):772–782, 2011.
  39. Efficient management of uncertainty in XML schema matching. VLDB J., 21(3):385–409, 2012.
  40. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17:333–351, 2016.
  41. Sampled suffix array with minimizers. Softw. Pract. Exp., 47(11):1755–1771, 2017.
  42. Information retrieval with verbose queries. In Ricardo Baeza-Yates, Mounia Lalmas, Alistair Moffat, and Berthier A. Ribeiro-Neto, editors, Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, August 9-13, 2015, pages 1121–1124. ACM, 2015.
  43. Monika Rauch Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Efthimis N. Efthimiadis, Susan T. Dumais, David Hawking, and Kalervo Järvelin, editors, SIGIR 2006: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle, Washington, USA, August 6-11, 2006, pages 284–291. ACM, 2006.
  44. Ranking queries on uncertain data. VLDB J., 20(1):129–153, 2011.
  45. The weighted suffix tree: An efficient data structure for handling molecular weighted sequences and its applications. Fundam. Informaticae, 71(2-3):259–277, 2006.
  46. Long-read mapping to repetitive reference sequences using winnowmap2. Nat Methods, 19:705–710, 2022.
  47. Probabilistic string similarity joins. In Ahmed K. Elmagarmid and Divyakant Agrawal, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 327–338. ACM, 2010.
  48. P-gram: Positional n-gram for the clustering of machine-generated messages. IEEE Access, 7:88504–88516, 2019.
  49. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition. Prentice Hall series in artificial intelligence. Prentice Hall, Pearson Education International, 2009.
  50. Indexing correlated probabilistic databases. In Ugur Çetintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, pages 455–468. ACM, 2009.
  51. Linear work suffix array construction. J. ACM, 53(6):918–936, 2006.
  52. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249–260, 1987.
  53. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching, 12th Annual Symposium, CPM 2001 Jerusalem, Israel, July 1-4, 2001 Proceedings, pages 181–192, 2001.
  54. MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Research, 31(13):3576–3579, 07 2003.
  55. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 756–767. ACM, 2019.
  56. Breaking the O(n)-barrier in the construction of compressed suffix arrays and suffix trees. In Nikhil Bansal and Viswanath Nagarajan, editors, Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, Florence, Italy, January 22-25, 2023, pages 5122–5202. SIAM, 2023.
  57. Pattern matching and consensus problems on weighted sequences and profiles. In Seok-Hee Hong, editor, 27th International Symposium on Algorithms and Computation, ISAAC 2016, December 12-14, 2016, Sydney, Australia, volume 64 of LIPIcs, pages 46:1–46:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016.
  58. Pattern matching and consensus problems on weighted sequences and profiles. Theory Comput. Syst., 63(3):506–542, 2019.
  59. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinform., 25(23):3181–3182, 2009.
  60. Efficient string matching with k mismatches. Theor. Comput. Sci., 43:239–249, 1986.
  61. Efficient matching of substrings in uncertain sequences. In Mohammed Javeed Zaki, Zoran Obradovic, Pang-Ning Tan, Arindam Banerjee, Chandrika Kamath, and Srinivasan Parthasarathy, editors, Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24-26, 2014, pages 767–775. SIAM, 2014.
  62. Long-read human genome sequencing and its applications. Nat. Rev. Genet., 21(10):597–614, 2020.
  63. Bidirectional string anchors: A new string sampling mechanism. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 64:1–64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
  64. Position-restricted substring searching. In José R. Correa, Alejandro Hevia, and Marcos A. Kiwi, editors, LATIN 2006: Theoretical Informatics, 7th Latin American Symposium, Valdivia, Chile, March 20-24, 2006, Proceedings, volume 3887 of Lecture Notes in Computer Science, pages 703–714. Springer, 2006.
  65. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993.
  66. Thesaurus based automatic keyphrase indexing. In Gary Marchionini, Michael L. Nelson, and Catherine C. Marshall, editors, ACM/IEEE Joint Conference on Digital Libraries, JCDL 2006, Chapel Hill, NC, USA, June 11-15, 2006, Proceedings, pages 296–297. ACM, 2006.
  67. Donald R. Morrison. PATRICIA - practical algorithm to retrieve information coded in alphanumeric. J. ACM, 15(4):514–534, 1968.
  68. Adaptive string dictionary compression in in-memory column-store database systems. In Sihem Amer-Yahia, Vassilis Christophides, Anastasios Kementsietsidis, Minos N. Garofalakis, Stratos Idreos, and Vincent Leroy, editors, Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, March 24-28, 2014, pages 283–294. OpenProceedings.org, 2014.
  69. CRAWDAD cister/rssi. https://dx.doi.org/10.15783/C7WC75, 2022.
  70. Finding significant matches of position weight matrices in linear time. IEEE ACM Trans. Comput. Biol. Bioinform., 8(1):69–79, 2011.
  71. Threshold query optimization for uncertain data. In Ahmed K. Elmagarmid and Divyakant Agrawal, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6-10, 2010, pages 315–326. ACM, 2010.
  72. Distributed probabilistic top-k dominating queries over uncertain databases. Knowl. Inf. Syst., 65(11):4939–4965, 2023.
  73. Reducing storage requirements for biological sequence comparison. Bioinform., 20(18):3363–3369, 2004.
  74. The european bioinformatics institute (EBI) databases. Nucleic Acids Res., 24(1):6–12, 1996.
  75. Winnowing: Local algorithms for document fingerprinting. In Alon Y. Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 76–85. ACM, 2003.
  76. Indexing uncertain categorical data. In Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis, editors, Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, pages 616–625. IEEE Computer Society, 2007.
  77. Database support for probabilistic attributes and tuples. In Gustavo Alonso, José A. Blakeley, and Arbee L. P. Chen, editors, Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Cancún, Mexico, pages 1053–1061. IEEE Computer Society, 2008.
  78. Efficient probabilistic truss indexing on uncertain graphs. In Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia, editors, WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, pages 354–366. ACM / IW3C2, 2021.
  79. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In Klemens Böhm, Christian S. Jensen, Laura M. Haas, Martin L. Kersten, Per-Åke Larson, and Beng Chin Ooi, editors, Proceedings of the 31st International Conference on Very Large Data Bases, Trondheim, Norway, August 30 - September 2, 2005, pages 922–933. ACM, 2005.
  80. Range search on multidimensional uncertain data. ACM Trans. Database Syst., 32(3):15, 2007.
  81. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118–135, 2018.
  82. Patterns of within-host genetic diversity in SARS-CoV-2. eLife, 10:e66857, aug 2021.
  83. Search by screenshots for universal article clipping in mobile apps. ACM Trans. Inf. Syst., 35(4):34:1–34:29, 2017.
  84. Assessing link quality in IEEE 802.11 wireless networks: Which is the right metric? In Proceedings of the IEEE 19th International Symposium on Personal, Indoor and Mobile Radio Communications, PIMRC 2008, 15-18 September 2008, Cannes, French Riviera, France, pages 1–6. IEEE, 2008.
  85. Get real: How benchmarks fail to represent the real world. In Alexander Böhm and Tilmann Rabl, editors, Proceedings of the 7th International Workshop on Testing Database Systems, DBTest@SIGMOD 2018, Houston, TX, USA, June 15, 2018, pages 1:1–1:6. ACM, 2018.
  86. Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973, pages 1–11, 1973.
  87. Aaron M. Wenger et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol., 37:1155–1162, 2019.
  88. Index-based optimal algorithm for computing k-cores in large uncertain graphs. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019, pages 64–75. IEEE, 2019.
  89. Efficient processing of top-k queries in uncertain databases with x-relations. IEEE Trans. Knowl. Data Eng., 20(12):1669–1682, 2008.
  90. Identifying top k dominating objects over uncertain data. In Sourav S. Bhowmick, Curtis E. Dyreson, Christian S. Jensen, Mong-Li Lee, Agus Muliantara, and Bernhard Thalheim, editors, Database Systems for Advanced Applications - 19th International Conference, DASFAA 2014, Bali, Indonesia, April 21-24, 2014. Proceedings, Part I, volume 8421 of Lecture Notes in Computer Science, pages 388–405. Springer, 2014.
  91. Improved design and analysis of practical minimizers. Bioinformatics, 36(Supplement_1):i119–i127, 07 2020.

Summary

We haven't generated a summary for this paper yet.