Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DNA Sequence Classification with Compressors (2401.14025v1)

Published 25 Jan 2024 in q-bio.GN and cs.LG

Abstract: Recent studies in DNA sequence classification have leveraged sophisticated machine learning techniques, achieving notable accuracy in categorizing complex genomic data. Among these, methods such as k-mer counting have proven effective in distinguishing sequences from varied species like chimpanzees, dogs, and humans, becoming a staple in contemporary genomic research. However, these approaches often demand extensive computational resources, posing a challenge in terms of scalability and efficiency. Addressing this issue, our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis. This innovative approach utilizes a variety of compression algorithms, such as Gzip, Brotli, and LZMA, to efficiently process and classify genomic sequences. Not only does this method align with the current state-of-the-art in terms of accuracy, but it also offers a more resource-efficient alternative to traditional machine learning methods. Our comprehensive evaluation demonstrates the proposed method's effectiveness in accurately classifying DNA sequences from multiple species. We present a detailed analysis of the performance of each algorithm used, highlighting the strengths and limitations of our approach in various genomic contexts. Furthermore, we discuss the broader implications of our findings for bioinformatics, particularly in genomic data processing and analysis. The results of our study pave the way for more efficient and scalable DNA sequence classification methods, offering significant potential for advancements in genomic research and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. Bruce Alberts. Molecular biology of the cell. Garland Publishing, New York, NY, 6 edition, November 2014.
  2. A locally adaptive data compression scheme. Communications of the ACM, 29(4):320–330, April 1986. ISSN 1557-7317. doi: 10.1145/5684.5688. URL http://dx.doi.org/10.1145/5684.5688.
  3. Michael Burrows. A block-sorting lossless data compression algorithm. SRS Research Report, 124, 1994.
  4. David Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, September 1952. ISSN 0096-8390. doi: 10.1109/jrproc.1952.273898. URL http://dx.doi.org/10.1109/JRPROC.1952.273898.
  5. Cell separation algorithm with enhanced search behaviour in mirna feature selection for cancer diagnosis. Information Systems, 104:101906, February 2022. ISSN 0306-4379. doi: 10.1016/j.is.2021.101906. URL http://dx.doi.org/10.1016/j.is.2021.101906.
  6. “low-resource” text classification: A parameter-free classification method with compressors. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 6810–6828, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.426. URL https://aclanthology.org/2023.findings-acl.426.
  7. An approach to dna sequence classification through machine learning: Dna sequencing, k mer counting, thresholding, sequence analysis. International Journal of Reliable and Quality E-Healthcare, 11(2):1–15, August 2022. ISSN 2160-956X. doi: 10.4018/ijrqeh.299963. URL http://dx.doi.org/10.4018/IJRQEH.299963.
  8. Spark-based parallel deep neural network model for classification of large scale rnas into pirnas and non-pirnas. IEEE Access, 8:136978–136991, 2020. ISSN 2169-3536. doi: 10.1109/access.2020.3011508. URL http://dx.doi.org/10.1109/ACCESS.2020.3011508.
  9. The similarity metric. IEEE Trans. Inf. Theory, 50(12):3250–3264, 2004. doi: 10.1109/TIT.2004.838101. URL https://doi.org/10.1109/TIT.2004.838101.
  10. Delucs: Deep learning for unsupervised clustering of dna sequences. PLOS ONE, 17(1):e0261531, January 2022. ISSN 1932-6203. doi: 10.1371/journal.pone.0261531. URL http://dx.doi.org/10.1371/journal.pone.0261531.
  11. K-mer-based machine learning method to classify ltr-retrotransposons in plant genomes. PeerJ, 9:e11456, May 2021. ISSN 2167-8359. doi: 10.7717/peerj.11456. URL http://dx.doi.org/10.7717/peerj.11456.
  12. Sukru Ozan. Dna sequence classification. https://github.com/sukruozan/DNA-Sequence-Classification, 2023.
  13. Determination of k-mer density in a dna sequence and subsequent cluster formation algorithm based on the application of electronic filter. Scientific Reports, 11(1), July 2021. ISSN 2045-2322. doi: 10.1038/s41598-021-93154-3. URL http://dx.doi.org/10.1038/s41598-021-93154-3.
  14. Nagesh Singh. Demystify dna sequencing with machine learning. https://www.kaggle.com/code/nageshsingh/demystify-dna-sequencing-with-machine-learning/notebook, 2023a.
  15. Nagesh Singh. Dna sequence dataset. https://www.kaggle.com/datasets/nageshsingh/dna-sequence-dataset, 2023b.
  16. A classification model for lncrna and mrna based on k-mers and a convolutional neural network. BMC Bioinformatics, 20(1), September 2019. ISSN 1471-2105. doi: 10.1186/s12859-019-3039-3. URL http://dx.doi.org/10.1186/s12859-019-3039-3.
  17. Explainable artificial intelligence model for identifying covid-19 gene biomarkers. Computers in Biology and Medicine, 154:106619, 2023. ISSN 0010-4825. doi: https://doi.org/10.1016/j.compbiomed.2023.106619. URL https://www.sciencedirect.com/science/article/pii/S0010482523000847.
  18. J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, 1977. doi: 10.1109/TIT.1977.1055714.

Summary

We haven't generated a summary for this paper yet.