FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework (2404.02163v1)

Published 22 Feb 2024 in cs.IT and math.IT

Abstract: Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ format files with an improved reference-based compression algorithm to achieve a higher compression ratio than other state-of-the-art algorithms. We propose FastqZip, which uses a new method mapping the sequence to reference for compression, allows reads-reordering and lossy quality scores, and the BSC or ZPAQ algorithm to perform final lossless compression for a higher compression ratio and relatively fast speed. Our method ensures the sequence can be losslessly reconstructed while allowing lossless or lossy compression for the quality scores. We reordered the reads to get a higher compression ratio. We evaluate our algorithms on five datasets and show that FastqZip can outperform the SOTA algorithm Genozip by around 10% in terms of compression ratio while having an acceptable slowdown.

References (40)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/Encoding/status/1775789909984346213

FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework (2404.02163v1)

Summary

Related Papers

Tweets