Effectiveness of DNALM pre-training in eukaryotes

Determine whether the pre-training strategies used in DNA language models (DNALMs) for eukaryotic genomes effectively capture key biological properties and consistently outperform traditional approaches.

Background

Transformer-based genomic foundation models treat DNA as a language and are pre-trained using self-supervised objectives such as masked language modeling. While these models have achieved strong performance on several tasks, most have focused on the reference genome and short sequence contexts, raising uncertainty about how well their pre-training captures core biological properties in complex eukaryotic genomes.

This paper introduces BMFM-DNA, including a SNP-aware variant-encoded pre-training approach, to address limitations of prior DNALMs that overlook natural genomic variation. The authors explicitly note that it remains unresolved whether existing DNALM pre-training strategies genuinely learn key biological properties and consistently surpass traditional methods, motivating further evaluation and methodology development.

References

However, in eukaryotes, questions remain about whether the pre-training strategies of DNA LLMs (DNALMs) effectively capture key biological properties and consistently outperform traditional approaches .

— BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects (2507.05265 - Li et al., 26 Jun 2025) in Section 1 (Introduction)

Effectiveness of DNALM pre-training in eukaryotes

Background

References

Related Problems