Dice Question Streamline Icon: https://streamlinehq.com

Clarify the comparative utility of self-supervised versus supervised genomic models for non-coding variant effect prediction

Determine, through standardized independent benchmarking, the relative utility of current self-supervised genomic language models (for example, DNABERT, LOGO, Nucleotide Transformer, Caduceus) compared to supervised sequence-to-activity models (for example, Enformer, Basenji2, Borzoi) for non-coding variant effect prediction in both zero-shot and fine-tuned settings, and ascertain the conditions under which each approach provides superior performance for downstream variant interpretation tasks.

Information Square Streamline Icon: https://streamlinehq.com

Background

The review contrasts two major paradigms for sequence-based modeling in genomics: supervised sequence-to-activity models trained on functional genomics data and self-supervised genomic LLMs pre-trained on DNA sequences. While supervised models have been extensively evaluated on variant effect prediction tasks, self-supervised models have seen more limited and heterogeneous evaluations, often making direct comparisons challenging.

Independent benchmarking efforts have reported mixed results regarding the relative performance of self-supervised models versus supervised models on downstream genomics tasks, especially for non-coding variant effect prediction. This ambiguity motivates a clear, standardized assessment to understand when self-supervised pretraining confers advantages over supervised training for these tasks.

References

The utility of current self-supervised models relative to supervised models remains unclear in independent benchmarking studies , indicating that additional work is needed to make optimal use of self-supervised tasks in genomic applications.

Leveraging genomic deep learning models for non-coding variant effect prediction (2411.11158 - Kathail et al., 17 Nov 2024) in Section “Conclusions and future perspectives”