Predicting gene expression directly from DNA sequence

Determine a reliable and accurate method to predict gene expression levels directly from genomic DNA sequence data, thereby resolving the open problem of gene expression prediction from sequence.

Background

The paper motivates the challenge by noting that gene expression is regulated by both proximal and distal genomic elements, and that current deep learning approaches struggle to capture long-range regulatory grammar and to generalize to unseen data. Benchmarks indicate limitations in modeling distal effects, underscoring the difficulty of deriving expression levels directly from sequence information.

GTA is proposed as a cross-modal approach leveraging a frozen LLM with token alignment of genomic features to address these challenges, but the authors explicitly acknowledge that predicting gene expression from sequence remains an open problem, framing their contribution within an ongoing research effort.

References

Therefore, predicting gene expression from sequence remains an open problem, the solving of which may lead to pivotal discoveries related to improving human healthcare.

Long-range gene expression prediction with token alignment of large language model (2410.01858 - Honig et al., 2 Oct 2024) in Section 1 (Introduction)