Minimizing the Minimizers via Alphabet Reordering (2405.04052v1)
Abstract: Minimizers sampling is one of the most widely-used mechanisms for sampling strings [Roberts et al., Bioinformatics 2004]. Let $S=S[1]\ldots S[n]$ be a string over a totally ordered alphabet $\Sigma$. Further let $w\geq 2$ and $k\geq 1$ be two integers. The minimizer of $S[i\mathinner{.\,.} i+w+k-2]$ is the smallest position in $[i,i+w-1]$ where the lexicographically smallest length-$k$ substring of $S[i\mathinner{.\,.} i+w+k-2]$ starts. The set of minimizers over all $i\in[1,n-w-k+2]$ is the set $\mathcal{M}{w,k}(S)$ of the minimizers of $S$. We consider the following basic problem: Given $S$, $w$, and $k$, can we efficiently compute a total order on $\Sigma$ that minimizes $|\mathcal{M}{w,k}(S)|$? We show that this is unlikely by proving that the problem is NP-hard for any $w\geq 2$ and $k\geq 1$. Our result provides theoretical justification as to why there exist no exact algorithms for minimizing the minimizers samples, while there exists a plethora of heuristics for the same purpose.
- Text indexing for long patterns: Anchors are all you need. Proc. VLDB Endow., 16(9):2117–2131, 2023. URL: https://www.vldb.org/pvldb/vol16/p2117-loukides.pdf, doi:10.14778/3598581.3598586.
- On the complexity of BWT-runs minimization via alphabet reordering. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders, editors, 28th Annual European Symposium on Algorithms, ESA 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 173 of LIPIcs, pages 15:1–15:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020. URL: https://doi.org/10.4230/LIPIcs.ESA.2020.15, doi:10.4230/LIPICS.ESA.2020.15.
- On the representation of de Bruijn graphs. J. Comput. Biol., 22(5):336–352, 2015. URL: https://doi.org/10.1089/cmb.2014.0160, doi:10.1089/CMB.2014.0160.
- KMC 2: fast and resource-frugal k-mer counting. Bioinform., 31(10):1569–1576, 2015. URL: https://doi.org/10.1093/bioinformatics/btv022, doi:10.1093/BIOINFORMATICS/BTV022.
- Finding an optimal alphabet ordering for Lyndon factorization is hard. In Markus Bläser and Benjamin Monmege, editors, 38th International Symposium on Theoretical Aspects of Computer Science, STACS 2021, March 16-19, 2021, Saarbrücken, Germany (Virtual Conference), volume 187 of LIPIcs, pages 35:1–35:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.STACS.2021.35, doi:10.4230/LIPICS.STACS.2021.35.
- Sampled suffix array with minimizers. Softw. Pract. Exp., 47(11):1755–1771, 2017. URL: https://doi.org/10.1002/spe.2481, doi:10.1002/SPE.2481.
- Differentiable learning of sequence-specific minimizer schemes with DeepMinimizer. J. Comput. Biol., 29(12):1288–1304, 2022. URL: https://doi.org/10.1089/cmb.2022.0275, doi:10.1089/CMB.2022.0275.
- Weighted minimizer sampling improves long read mapping. Bioinform., 36(Supplement-1):i111–i118, 2020. URL: https://doi.org/10.1093/bioinformatics/btaa435, doi:10.1093/BIOINFORMATICS/BTAA435.
- Richard M. Karp. Reducibility among combinatorial problems. In Raymond E. Miller and James W. Thatcher, editors, Proceedings of a symposium on the Complexity of Computer Computations, held March 20-22, 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA, The IBM Research Symposia Series, pages 85–103. Plenum Press, New York, 1972. URL: https://doi.org/10.1007/978-1-4684-2001-2_9, doi:10.1007/978-1-4684-2001-2\_9.
- Efficient randomized pattern-matching algorithms. IBM J. Res. Dev., 31(2):249–260, 1987. URL: https://doi.org/10.1147/rd.312.0249, doi:10.1147/RD.312.0249.
- Heng Li. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinform., 32(14):2103–2110, 2016. URL: https://doi.org/10.1093/bioinformatics/btw152, doi:10.1093/BIOINFORMATICS/BTW152.
- Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinform., 34(18):3094–3100, 2018. URL: https://doi.org/10.1093/bioinformatics/bty191, doi:10.1093/BIOINFORMATICS/BTY191.
- Bidirectional string anchors: A new string sampling mechanism. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 64:1–64:21. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ESA.2021.64, doi:10.4230/LIPICS.ESA.2021.64.
- Bidirectional string anchors for improved text indexing and top-$k$ similarity search. IEEE Trans. Knowl. Data Eng., 35(11):11093–11111, 2023. URL: https://doi.org/10.1109/TKDE.2022.3231780, doi:10.1109/TKDE.2022.3231780.
- Compact universal k-mer hitting sets. In Martin C. Frith and Christian Nørgaard Storm Pedersen, editors, Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Aarhus, Denmark, August 22-24, 2016. Proceedings, volume 9838 of Lecture Notes in Computer Science, pages 257–268. Springer, 2016. URL: https://doi.org/10.1007/978-3-319-43681-4_21, doi:10.1007/978-3-319-43681-4\_21.
- Feedback arc set problem and np-hardness of minimum recurrent configuration problem of chip-firing game on directed graphs. Ann. Comb., 19:373–396, 2015. URL: https://link.springer.com/article/10.1007/s00026-015-0266-9, doi:10.1007/s00026-015-0266-9.
- Reducing storage requirements for biological sequence comparison. Bioinform., 20(18):3363–3369, 2004. URL: https://doi.org/10.1093/bioinformatics/bth408, doi:10.1093/bioinformatics/bth408.
- Winnowing: Local algorithms for document fingerprinting. In Alon Y. Halevy, Zachary G. Ives, and AnHai Doan, editors, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, pages 76–85. ACM, 2003. URL: https://doi.org/10.1145/872757.872770, doi:10.1145/872757.872770.
- Space-efficient representation of genomic k-mer count tables. Algorithms Mol. Biol., 17(1):5, 2022. URL: https://doi.org/10.1186/s13015-022-00212-0, doi:10.1186/S13015-022-00212-0.
- Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):R46, 2014.
- Daniel H. Younger. Minimum feedback arc sets for a directed graph. IEEE Transactions on Circuit Theory, 10(2):238–245, 1963. doi:10.1109/TCT.1963.1082116.
- Improved design and analysis of practical minimizers. Bioinform., 36(Supplement-1):i119–i127, 2020. URL: https://doi.org/10.1093/bioinformatics/btaa472, doi:10.1093/BIOINFORMATICS/BTAA472.
- Sequence-specific minimizers via polar sets. Bioinform., 37(Supplement):187–195, 2021. URL: https://doi.org/10.1093/bioinformatics/btab313, doi:10.1093/BIOINFORMATICS/BTAB313.
- Creating and using minimizer sketches in computational genomics. J. Comput. Biol., 30(12):1251–1276, 2023. URL: https://doi.org/10.1089/cmb.2023.0094, doi:10.1089/CMB.2023.0094.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.