An experimental sorting method for improving metagenomic data encoding (2401.01786v1)
Abstract: Minimizing data storage poses a significant challenge in large-scale metagenomic projects. In this paper, we present a new method for improving the encoding of FASTQ files generated by metagenomic sequencing. This method incorporates metagenomic classification followed by a recursive filter for clustering reads by DNA sequence similarity to improve the overall reference-free compression. In the results, we show an overall improvement in the compression of several datasets. As hypothesized, we show a progressive compression gain for higher coverage depth and number of identified species. Additionally, we provide an implementation that is freely available at https://github.com/cobilab/mizar and can be customized to work with other FASTQ compression tools.
- “Pedobacter lusitanus sp. nov., isolated from sludge of a deactivated uranium mine,” International journal of systematic and evolutionary microbiology, vol. 67, no. 5, pp. 1339–1348, 2017.
- “Unmasking the tissue-resident eukaryotic DNA virome in humans,” Nucleic Acids Research, vol. 51, no. 7, pp. 3223–3239, 2023.
- “The landscape of persistent human DNA viruses in femoral bone,” Forensic Science International: Genetics, vol. 48, pp. 102353, 2020.
- “The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features,” GigaScience, vol. 11, pp. giac028, 2022.
- “Mapping the impact of alien species on marine ecosystems: the Mediterranean Sea case study,” Diversity and Distributions, vol. 22, no. 6, pp. 694–707, 2016.
- “Metagenomics of extreme environments,” Current opinion in microbiology, vol. 25, pp. 97–102, 2015.
- “A survey on data compression methods for biological sequences,” Information, vol. 7, no. 4, pp. 56, 2016.
- “SCALCE: boosting sequence compression algorithms using locally consistent encoding,” Bioinformatics, vol. 28, no. 23, pp. 3051–3057, 2012.
- “Compression of next-generation sequencing reads aided by highly efficient de novo assembly,” Nucleic acids research, vol. 40, no. 22, pp. e171–e171, 2012.
- “Compression of FASTQ and SAM format sequencing data,” PloS one, vol. 8, no. 3, pp. e59190, 2013.
- “DSRC 2—Industry-oriented compression of FASTQ files,” Bioinformatics, vol. 30, no. 15, pp. 2213–2215, 2014.
- “Disk-based compression of data from genome sequencing,” Bioinformatics, vol. 31, no. 9, pp. 1389–1395, 2015.
- “Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph,” BMC bioinformatics, vol. 16, no. 1, pp. 1–14, 2015.
- “LW-FQZip 2: a parallelized reference-based compression of FASTQ files,” BMC bioinformatics, vol. 18, pp. 1–8, 2017.
- “Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis,” Bioinformatics, vol. 34, no. 4, pp. 558–567, 2018.
- “FaStore: a space-saving solution for raw sequencing data,” Bioinformatics, vol. 34, no. 16, pp. 2748–2756, 2018.
- “SPRING: a next-generation compressor for FASTQ data,” Bioinformatics, vol. 35, no. 15, pp. 2674–2676, 2019.
- “FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model,” Bioinformatics, vol. 38, no. 2, pp. 351–356, 2022.
- “Sorting next generation sequencing data improves compression effectiveness,” in 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW). IEEE, 2010, pp. 567–572.
- “Metagenomic composition analysis of sedimentary ancient DNA from the Isle of Wight,” in 2018 26th european signal processing conference (EUSIPCO). IEEE, 2018, pp. 1177–1181.
- “Metagenomic composition analysis of an ancient sequenced polar bear jawbone from Svalbard,” Genes, vol. 9, no. 9, pp. 445, 2018.
- “Authorship attribution using relative compression,” in 2016 Data Compression Conference (DCC). IEEE, 2016, pp. 329–338.
- “JARVIS2: a data compressor for large genome sequences,” in 2023 Data Compression Conference (DCC). IEEE, 2023, pp. 288–297.
- “ART: a next-generation sequencing read simulator,” Bioinformatics, vol. 28, no. 4, pp. 593–594, 2012.