Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Taxonomic classification with maximal exact matches in KATKA kernels and minimizer digests (2402.06935v2)

Published 10 Feb 2024 in cs.DS, q-bio.GN, and q-bio.PE

Abstract: For taxonomic classification, we are asked to index the genomes in a phylogenetic tree such that later, given a DNA read, we can quickly choose a small subtree likely to contain the genome from which that read was drawn. Although popular classifiers such as Kraken use $k$-mers, recent research indicates that using maximal exact matches (MEMs) can lead to better classifications. For example, we can build an augmented FM-index over the the genomes in the tree concatenated in left-to-right order; for each MEM in a read, find the interval in the suffix array containing the starting positions of that MEM's occurrences in those genomes; find the minimum and maximum values stored in that interval; take the lowest common ancestor (LCA) of the genomes containing the characters at those positions. This solution is practical, however, only when the total size of the genomes in the tree is fairly small. In this paper we consider applying the same solution to three lossily compressed representations of the genomes' concatenation: a KATKA kernel, which discards characters that are not in the first or last occurrence of any $k_{\max}$-tuple, for a parameter $k_{\max}$; a minimizer digest; a KATKA kernel of a minimizer digest. With a test dataset and these three representations of it, simulated reads and various parameter settings, we checked how many reads' longest MEMs occurred only in the sequences from which those reads were generated ("true positive" reads). For some parameter settings we achieved significant compression while only slightly decreasing the true-positive rate.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Pan-genomic matching statistics for targeted nanopore sequencing. Iscience, 24(6), 2021.
  2. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023.
  3. Near-optimal probabilistic RNA-Seq quantification. Nature biotechnology, 34(5):525–527, 2016.
  4. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts. Genome biology, 19(1):1–10, 2018.
  5. Faster repetition-aware compressed suffix trees based on block trees. Information and Computation, 285:104749, 2022.
  6. Factors affecting k-mer specificity and alternative approaches for metagenomic classification. In preparation.
  7. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell systems, 12(10):958–968, 2021.
  8. Hybrid indexes for repetitive datasets. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2016):20130137, 2014.
  9. Hybrid indexing revisited. In 2018 Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments (ALENEX), pages 1–8. SIAM, 2018.
  10. KATKA: A KRAKEN-like tool with k given at query time. In International Symposium on String Processing and Information Retrieval, pages 191–197. Springer, 2022.
  11. Searching and indexing genomic databases via kernelization. Frontiers in Bioengineering and Biotechnology, 3:12, 2015.
  12. Younan Gao. Computing matching statistics on repetitive texts. In 2022 Data Compression Conference (DCC), pages 73–82. IEEE, 2022.
  13. From theory to practice: Plug and play with succinct data structures. In Experimental Algorithms: 13th International Symposium, SEA 2014, Copenhagen, Denmark, June 29–July 1, 2014. Proceedings 13, pages 326–337. Springer, 2014.
  14. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research, 26(12):1721–1729, 2016.
  15. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nature biotechnology, 39(4):431–441, 2021.
  16. An extension of the Burrows–Wheeler transform. Theoretical Computer Science, 387(3):298–312, 2007.
  17. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature communications, 7(1):11257, 2016.
  18. RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome biology, 19(1):1–10, 2018.
  19. Gonzalo Navarro. Computing MEMs on repetitive text collections. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Schloss-Dagstuhl-Leibniz Zentrum für Informatik, 2023.
  20. Computing matching statistics and maximal exact matches on compressed full-text indexes. In String Processing and Information Retrieval: 17th International Symposium, SPIRE 2010, Los Cabos, Mexico, October 11-13, 2010. Proceedings 17, pages 347–358. Springer, 2010.
  21. DUDes: a top-down taxonomic profiler for metagenomics. Bioinformatics, 32(15):2272–2280, 2016.
  22. Compressing similar biological sequences using FM-index. In 2014 Data Compression Conference, pages 312–321. IEEE, 2014.
  23. The silva ribosomal rna gene database project: improved data processing and web-based tools. Nucleic acids research, 41(D1):D590–D596, 2012.
  24. Reducing storage requirements for biological sequence comparison. Bioinformatics, 20(18):3363–3369, 2004.
  25. Towards pan-genome read alignment to improve variation calling. BMC genomics, 19(2):123–130, 2018.
  26. vgteam. sdsl-lite. https://github.com/vgteam/sdsl-lite, 2022.
  27. Improved metagenomic analysis with Kraken 2. Genome biology, 20:1–13, 2019.
  28. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3):1–12, 2014.

Summary

We haven't generated a summary for this paper yet.