Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An experimental sorting method for improving metagenomic data encoding (2401.01786v1)

Published 3 Jan 2024 in cs.IT, math.IT, and q-bio.GN

Abstract: Minimizing data storage poses a significant challenge in large-scale metagenomic projects. In this paper, we present a new method for improving the encoding of FASTQ files generated by metagenomic sequencing. This method incorporates metagenomic classification followed by a recursive filter for clustering reads by DNA sequence similarity to improve the overall reference-free compression. In the results, we show an overall improvement in the compression of several datasets. As hypothesized, we show a progressive compression gain for higher coverage depth and number of identified species. Additionally, we provide an implementation that is freely available at https://github.com/cobilab/mizar and can be customized to work with other FASTQ compression tools.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Pedobacter lusitanus sp. nov., isolated from sludge of a deactivated uranium mine,” International journal of systematic and evolutionary microbiology, vol. 67, no. 5, pp. 1339–1348, 2017.
  2. “Unmasking the tissue-resident eukaryotic DNA virome in humans,” Nucleic Acids Research, vol. 51, no. 7, pp. 3223–3239, 2023.
  3. “The landscape of persistent human DNA viruses in femoral bone,” Forensic Science International: Genetics, vol. 48, pp. 102353, 2020.
  4. “The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features,” GigaScience, vol. 11, pp. giac028, 2022.
  5. “Mapping the impact of alien species on marine ecosystems: the Mediterranean Sea case study,” Diversity and Distributions, vol. 22, no. 6, pp. 694–707, 2016.
  6. “Metagenomics of extreme environments,” Current opinion in microbiology, vol. 25, pp. 97–102, 2015.
  7. “A survey on data compression methods for biological sequences,” Information, vol. 7, no. 4, pp. 56, 2016.
  8. “SCALCE: boosting sequence compression algorithms using locally consistent encoding,” Bioinformatics, vol. 28, no. 23, pp. 3051–3057, 2012.
  9. “Compression of next-generation sequencing reads aided by highly efficient de novo assembly,” Nucleic acids research, vol. 40, no. 22, pp. e171–e171, 2012.
  10. “Compression of FASTQ and SAM format sequencing data,” PloS one, vol. 8, no. 3, pp. e59190, 2013.
  11. “DSRC 2—Industry-oriented compression of FASTQ files,” Bioinformatics, vol. 30, no. 15, pp. 2213–2215, 2014.
  12. “Disk-based compression of data from genome sequencing,” Bioinformatics, vol. 31, no. 9, pp. 1389–1395, 2015.
  13. “Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph,” BMC bioinformatics, vol. 16, no. 1, pp. 1–14, 2015.
  14. “LW-FQZip 2: a parallelized reference-based compression of FASTQ files,” BMC bioinformatics, vol. 18, pp. 1–8, 2017.
  15. “Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis,” Bioinformatics, vol. 34, no. 4, pp. 558–567, 2018.
  16. “FaStore: a space-saving solution for raw sequencing data,” Bioinformatics, vol. 34, no. 16, pp. 2748–2756, 2018.
  17. “SPRING: a next-generation compressor for FASTQ data,” Bioinformatics, vol. 35, no. 15, pp. 2674–2676, 2019.
  18. “FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model,” Bioinformatics, vol. 38, no. 2, pp. 351–356, 2022.
  19. “Sorting next generation sequencing data improves compression effectiveness,” in 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW). IEEE, 2010, pp. 567–572.
  20. “Metagenomic composition analysis of sedimentary ancient DNA from the Isle of Wight,” in 2018 26th european signal processing conference (EUSIPCO). IEEE, 2018, pp. 1177–1181.
  21. “Metagenomic composition analysis of an ancient sequenced polar bear jawbone from Svalbard,” Genes, vol. 9, no. 9, pp. 445, 2018.
  22. “Authorship attribution using relative compression,” in 2016 Data Compression Conference (DCC). IEEE, 2016, pp. 329–338.
  23. “JARVIS2: a data compressor for large genome sequences,” in 2023 Data Compression Conference (DCC). IEEE, 2023, pp. 288–297.
  24. “ART: a next-generation sequencing read simulator,” Bioinformatics, vol. 28, no. 4, pp. 593–594, 2012.
Citations (1)

Summary

We haven't generated a summary for this paper yet.