FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework (2404.02163v1)
Abstract: Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ format files with an improved reference-based compression algorithm to achieve a higher compression ratio than other state-of-the-art algorithms. We propose FastqZip, which uses a new method mapping the sequence to reference for compression, allows reads-reordering and lossy quality scores, and the BSC or ZPAQ algorithm to perform final lossless compression for a higher compression ratio and relatively fast speed. Our method ensures the sequence can be losslessly reconstructed while allowing lossless or lossy compression for the quality scores. We reordered the reads to get a higher compression ratio. We evaluate our algorithms on five datasets and show that FastqZip can outperform the SOTA algorithm Genozip by around 10% in terms of compression ratio while having an acceptable slowdown.
- P. J. A. Cock, C. J. Fields, N. Goto, M. L. Heuer, and P. M. Rice, “The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants,” Nucleic Acids Research, vol. 38, no. 6, pp. 1767–1771, 12 2009. [Online]. Available: https://doi.org/10.1093/nar/gkp1137
- M. Hernaez, D. Pavlichin, T. Weissman, and I. Ochoa, “Genomic data compression,” Annual Review of Biomedical Data Science, vol. 2, pp. 19–37, 2019.
- A. Shah and M. Sethi, “The improvised gzip, a technique for real time lossless data compression,” EAI Endorsed Transactions on Context-aware Systems and Applications, vol. 6, p. 160599, 06 2019.
- J. Gilchrist, “Parallel data compression with bzip2,” 01 2004.
- B. Chern, I. Ochoa, A. Manolakos, A. No, K. Venkat, and T. Weissman, “Reference based genome compression,” IEEE Inf Theory Workshop, ITW, 04 2012.
- C. Kingsford and R. Patro, “Reference-based compression of short-read sequences using path encoding,” Bioinformatics, vol. 31, no. 12, pp. 1920–1928, 02 2015. [Online]. Available: https://doi.org/10.1093/bioinformatics/btv071
- G. D. y Álvarez, G. Seroussi, P. Smircich, J. Sotelo-Silveira, I. Ochoa, and Á. Martín, “Renano: a reference-based compressor for nanopore fastq files,” bioRxiv, 2021. [Online]. Available: https://www.biorxiv.org/content/early/2021/06/01/2021.03.26.437155
- D. Lan, R. Tobler, Y. Souilmi, and B. Llamas, “Genozip: a universal extensible genomic data compressor,” Bioinformatics, vol. 37, no. 16, pp. 2225–2230, 02 2021. [Online]. Available: https://doi.org/10.1093/bioinformatics/btab102
- International Human Genome Sequencing Consortium, “Initial sequencing and analysis of the human genome,” Nature, vol. 409, p. 860–921, 2001.
- M. Sardaraz, M. Tahir, A. A. Ikram, and H. Bajwa, “Seqcompress: An algorithm for biological sequence compression,” Genomics, vol. 104, no. 4, pp. 225–228, 2014. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0888754314001499
- U. Ghoshdastider and B. Saha, “Genomecompress: A novel algorithm for dna compression,” 2007.
- A. K. Gupta and U. Gupta, “Chapter 20 - next generation sequencing and its applications,” in Animal Biotechnology (Second Edition), 2nd ed., A. S. Verma and A. Singh, Eds. Boston: Academic Press, 2020, pp. 395–421. [Online]. Available: https://www.sciencedirect.com/science/article/pii/B9780128117101000185
- L. Hood and R. L., “The human genome project: big science transforms biology and medicine.” Genome Med, 2013.
- W. Gilbert and A. Maxam, “The nucleotide sequence of the ¡i¿lac¡/i¿ operator,” Proceedings of the National Academy of Sciences, vol. 70, no. 12, pp. 3581–3584, 1973. [Online]. Available: https://www.pnas.org/doi/abs/10.1073/pnas.70.12.3581
- C. A. Sanger F, Nicklen S, “Dna sequencing with chain-terminating inhibitors,” Proc Natl Acad Sci U S A., 12 1977.
- S. R. et al., “High-throughput snp genotyping on universal bead arrays.” Mutat Res., pp. 70–82, 06 2005.
- Y. Zhang, L. Li, Y. Yang, X. Yang, S. He, and Z. Zhu, “Light-weight reference-based compression of fastq data,” BMC bioinformatics, vol. 16, pp. 1–8, 2015.
- Z.-A. Huang, Z. Wen, Q. Deng, Y. Chu, Y. Sun, and Z. Zhu, “Lw-fqzip 2: a parallelized reference-based compression of fastq files,” BMC bioinformatics, vol. 18, pp. 1–8, 2017.
- Y. Xing, G. Li, Z. Wang, B. Feng, Z. Song, and C. Wu, “Gtz: a fast compression and cloud transmission tool optimized for fastq files,” BMC bioinformatics, vol. 18, no. 16, pp. 233–242, 2017.
- G. Benoit, C. Lemaitre, D. Lavenier, E. Drezen, T. Dayris, R. Uricaru, and G. Rizk, “Reference-free compression of high throughput sequencing data with a probabilistic de bruijn graph,” BMC bioinformatics, vol. 16, no. 1, pp. 1–14, 2015.
- D. C. Jones, W. L. Ruzzo, X. Peng, and M. G. Katze, “Compression of next-generation sequencing reads aided by highly efficient de novo assembly,” Nucleic acids research, vol. 40, no. 22, pp. e171–e171, 2012.
- Ł. Roguski and S. Deorowicz, “Dsrc 2—industry-oriented compression of fastq files,” Bioinformatics, vol. 30, no. 15, pp. 2213–2215, 2014.
- A. Dutta, M. M. Haque, T. Bose, C. V. S. K. Reddy, and S. S. Mande, “Fqc: A novel approach for efficient compression, archival, and dissemination of fastq datasets,” Journal of bioinformatics and computational biology, 2015.
- J. Bonfield and M. Mahoney, “Compression of fastq and sam format sequencing data,” PloS one, vol. 8, p. e59190, 03 2013.
- M. Nicolae, S. Pathak, and S. Rajasekaran, “LFQC: a lossless compression algorithm for FASTQ files,” Bioinformatics, vol. 31, no. 20, pp. 3276–3281, 06 2015. [Online]. Available: https://doi.org/10.1093/bioinformatics/btv384
- S. Chandak, K. Tatwawadi, I. Ochoa, M. Hernaez, and T. Weissman, “Spring: a next-generation compressor for fastq data,” Bioinformatics, Aug 2019.
- S. Deorowicz, “Fqsqueezer: k-mer-based compression of sequencing data,” Scientific Reports, 2020.
- A. Bookstein, V. Kulyukin, and T. Raita, “Generalized hamming distance,” Information Retrieval, vol. 5, 10 2002.
- C. Zhao, “String correction using the damerau-levenshtein distance,” BMC Bioinformatics, 06 2019.
- S. Marco-Sola, J. M. Eizenga, A. Guarracino, B. Paten, E. Garrison, and M. Moreto, “Optimal gap-affine alignment in O(s) space,” Bioinformatics, vol. 39, no. 2, p. btad074, 02 2023. [Online]. Available: https://doi.org/10.1093/bioinformatics/btad074
- S. Marco-Sola, J. C. Moure, M. Moreto, and A. Espinosa, “Fast gap-affine pairwise alignment using the wavefront algorithm,” Bioinformatics, vol. 37, no. 4, pp. 456–463, 09 2020. [Online]. Available: https://doi.org/10.1093/bioinformatics/btaa777
- H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and . G. P. D. P. Subgroup, “The Sequence Alignment/Map format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp. 2078–2079, 06 2009. [Online]. Available: https://doi.org/10.1093/bioinformatics/btp352
- Facebook, “Zstandard,” https://github.com/facebook/zstd/releases.
- M. V. Mahoney, “The zpaq compression algorithm,” 2015, https://api.semanticscholar.org/CorpusID:13248511.
- M. Sardaraz and M. Tahir, “Sca-ngs: Secure compression algorithm for next generation sequencing data using genetic operators and block sorting,” Science Progress, vol. 104, no. 2, p. 00368504211023276, 2021, pMID: 34143692. [Online]. Available: https://doi.org/10.1177/00368504211023276
- X. Hongxin, “DNBSEQT7 WES-PE150 demo data,” https://db.cngb.org/search/project/CNP0003660/, 10 2022.
- ——, “MGISEQ-200 WES PE100 demo data,” https://db.cngb.org/search/project/CNP0003664/, 11 2022.
- “RNA-Seq of UHRR,” https://db.cngb.org/search/experiment/CNX0048124/.
- “DNBSEQ-T7 WES PE150 ,” https://db.cngb.org/search/experiment/CNX0547764/.
- “BGISEQ500 PCRfree NA12878 CL100076243 L01,” https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/BGISEQ500/.