Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
76 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

GENERator: A Long-Context Generative Genomic Foundation Model (2502.07272v3)

Published 11 Feb 2025 in cs.CL and q-bio.GN

Abstract: Advancements in DNA sequencing technologies have significantly improved our ability to decode genomic sequences. However, the prediction and interpretation of these sequences remain challenging due to the intricate nature of genetic material. LLMs have introduced new opportunities for biological sequence analysis. Recent developments in genomic LLMs have underscored the potential of LLMs in deciphering DNA sequences. Nonetheless, existing models often face limitations in robustness and application scope, primarily due to constraints in model structure and training data scale. To address these limitations, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences that translate into proteins structurally analogous to known families. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles. These capabilities position the GENERator as a pivotal tool for genomic research and biotechnological advancement, enhancing our ability to interpret and predict complex biological systems and enabling precise genomic interventions. Implementation details and supplementary resources are available at https://github.com/GenerTeam/GENERator.

Summary

  • The paper introduces a transformer-based generative model processing up to 98,000 base pairs with 1.2 billion parameters for eukaryotic DNA.
  • The paper demonstrates state-of-the-art performance in next K-mer prediction, sequence classification, and protein-coding sequence generation.
  • The paper highlights practical applications in precision medicine and synthetic biology by designing functionally relevant genomic sequences.

Evaluating the Generator: A Long-Context Generative Genomic Foundation Model

This paper presents "Generator," a substantial advancement in the application of LLMs to genomic sequence analysis and synthesis. Developed with a focus on the eukaryotic domain, this generative genomic foundation model incorporates unique architectural and stylistic adaptations to effectively handle DNA sequences, addressing challenges in both sequence interpretation and design.

Model Architecture and Training

The Generator employs a transformer decoder architecture, similar to state-of-the-art LLMs in NLP. In contrast to previous models constrained by shorter context lengths, the Generator is notable for its context reach of 98,000 base pairs and a parameter size of 1.2 billion. The model was trained on a comprehensive dataset of 386 billion base pairs of eukaryotic DNA sourced from the RefSeq database, focusing on gene regions. This emphasis enables it to efficiently capture the semantic richness of biologically functional sequences and yields state-of-the-art performance across various genomic tasks.

Robust Results

The Generator exhibits exemplary performance across a series of benchmarks and new tasks, including next K-mer prediction and sequence classification. Its predictive consistency reaffirms the relevance of using large-scale, semantically focused training datasets. These tasks extend beyond short genomic sequences, preparing the field for practical applications with sequences that better reflect the complexity of natural gene segments.

Biological Relevance and Sequence Design

Most notably, this model aligns with the central dogma of molecular biology, effectively generating protein-coding sequences that yield proteins of structural similarity to known families. Performance on tasks involving Histone and Cytochrome P450 protein families demonstrates its utility in producing biologically meaningful sequences, ensuring both functional fidelity and structural stability. The Generator also shows significant competency in designing promoter sequences with specified activity profiles, enabling precision in biological research and biotechnology applications.

Future Directions

While this model marks a notable achievement in genomic LLMing, the future scope involves extending its training to encompass prokaryotic and viral genomes, akin to its counterpart Evo, to provide comprehensive genomic analyses. This separation is practical given the differences between eukaryotic and prokaryotic systems. Further biological validation is planned to cement the Generator’s role in precise genomic interventions, advancing synthetic biology and biotechnological applications.

Implications and Collaboration

The Generator provides a robust framework for understanding and manipulating genomic sequences. It has implications for improving precision medicine, gene therapy, and synthetic biology. In pursuit of open research and collaboration, all accompanying resources—including data, code, and model weights—are anticipated to be made freely accessible on a designated platform.

This research emphasizes the value of domain-specific LLMs tailored to genomic sequences, advancing foundational understanding and application capabilities in genomics. As such, it sets a precedent for future research to harness transformer models' potential in biological contexts, providing richer insights into the complexities of life encoded within DNA.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com