Dice Question Streamline Icon: https://streamlinehq.com

Determine the source of unexpected sequences in COLO829 PacBio HiFi reads

Determine the biological origin of approximately 95 kilobases of sequences found across 43 PacBio HiFi reads from the COLO829 tumor sample that lack supermaximal exact matches of length at least 51 base pairs when mapped to the human pangenome index comprising 100 human haplotype-resolved assemblies including T2T-CHM13; note that these sequences could not be assembled and NCBI BLAST reported only multiple weak matches to Bos taurus genomes.

Information Square Streamline Icon: https://streamlinehq.com

Background

As an application of the pangenome index, the authors mapped PacBio HiFi reads from the COLO829 tumor sample to a human pangenome index (human100) and extracted read regions of at least 1 kb that were not covered by supermaximal exact matches (SMEMs) of 51 bp or longer.

This procedure yielded 95 kb of sequence across 43 reads that could not be assembled; subsequent NCBI BLAST searches suggested only multiple weak hits to cow genomes, leaving the origin of these sequences unresolved.

References

NCBI BLAST suggested multiple weak hits to cow genomes. We could not identify the source of these sequences but there were few of them anyway.

BWT construction and search at the terabase scale (2409.00613 - Li, 1 Sep 2024) in Results, Identifying novel sequences