Dice Question Streamline Icon: https://streamlinehq.com

Impact of GenomeScope assumptions on estimate accuracy

Determine the extent to which deviations from the assumptions of the GenomeScope and GenomeScope 2.0 k-mer spectrum models—including non-uniform distribution of heterozygosity across the genome, the presence of variant types beyond single-nucleotide polymorphisms, and genomic regions with more than two copies—impact the accuracy of parameter estimates produced by these models, and ascertain how this impact varies among species.

Information Square Streamline Icon: https://streamlinehq.com

Background

GenomeScope and GenomeScope 2.0 infer genome properties (notably heterozygosity and repeat content) by fitting probabilistic models to k-mer spectra. These models assume, among other things, uniformly distributed heterozygosity, that variants are primarily SNPs, and that repetitiveness predominantly involves two-copy regions.

The authors note that real genomes can violate these assumptions—for example, heterozygosity may be non-uniform, many variants are not SNPs, and repeats may have more than two copies—potentially biasing estimates. The magnitude of this bias and its dependence on species remains unresolved, motivating a focused investigation into the robustness of GenomeScope-derived estimates under realistic genomic conditions.

References

The GenomeScope model is still somewhat "unrealistic" for several reasons: different regions within genome have different probability of being heterozygous (i.e. heterozygosity is not uniformly distributed in a genome); many variants are not just SNPs; and/or a large proportion of the genome might be covered by repetitions with more than two copies. How much of a problem this presents in the estimates is still an open question, and the answer is most likely dependent on the studied species.

Guide to k-mer approaches for genomics across the tree of life (2404.01519 - Jenike et al., 1 Apr 2024) in Supplementary Text 2