Cause of duplicate OMArk HOGs in Fistulifera solaris

Determine the underlying biological or technical factors responsible for the unexpectedly high number of duplicate conserved homologous genes (HOGs) detected by OMArk in the Fistulifera solaris protein-coding gene annotation, given that inspection of genome assembly statistics did not reveal an explanation.

Background

To assess annotation quality, the study applied OMArk, which evaluates conserved homologous genes (HOGs) from the OMA database and properly handles alternative transcript isoforms. Across most diatom assemblies, HOG completeness was high, but an unusually large number of duplicate HOGs was observed for Thalassiosira profunda and Fistulifera solaris.

For Thalassiosira profunda, the duplicates were consistent with BUSCO genome-level scores, suggesting concordant signals across metrics. In contrast, the origin of the duplicate HOGs in Fistulifera solaris could not be explained by the available genome assembly statistics, motivating a focused investigation into whether the duplicates arise from biological features (e.g., genome architecture, recent duplications) or technical artifacts (e.g., assembly or annotation issues).

References

In contrast, the source of duplicates in F. solaris remains unclear. We explored the genome assembly statistics (Fig. 8) but found no obvious explanation.

Annotation of protein-coding genes in 49 diatom genomes from the Bacillariophyta clade  (2410.05467 - Nenasheva et al., 2024) in Technical Validation, OMArk results (discussion around Fig. 6–8)