Genomic Next-Token Predictors
- Genomic next-token predictors are large-scale autoregressive models that predict the next DNA token by leveraging advanced LLM strategies for sequence annotation and generative tasks.
- They use tailored transformer architectures with specialized tokenization (k-mer, BPE, or single nucleotides) and long-range context processing to enable high-resolution genomic modeling.
- These models support multi-task and cross-modal learning, driving innovations in genomic classification, functional annotation, and generative design of DNA sequences.
Genomic next-token predictors are a class of large-scale, autoregressive sequence models trained to predict the next DNA token given a genomic context. Leveraging modeling strategies and architectural innovations from LLMs, these predictors have achieved state-of-the-art results in genomic sequence understanding, annotation, and generative tasks by optimizing over diverse datasets, scaling to long genomic contexts, and enabling cross-modal and multi-task learning. The following sections review the foundational principles, leading models, pretraining methodologies, evaluation strategies, multi-task/cross-modal extensions, and downstream implications of genomic next-token predictors.
1. Model Architectures and Tokenization Schemes
Autoregressive genomic models use causal decoders, typically based on transformer architectures, but with modifications tailored to DNA. Major design choices include the use of rotary position embeddings (RoPE), customized normalization (e.g., no-bias LayerNorm, RMSNorm), and tokenization strategies centering on k-mer or BPE vocabularies. Notable models and their key components are summarized below.
| Model | Architecture | Tokenization | Max Context |
|---|---|---|---|
| Omni-DNA | Auto-regressive Transformer (LLaMA/OLMo-like), 8–16 layers, 8–16 heads, sizes 20M–1B params | BPE, initial vocab 4096, variable k-mers | 250 tokens (∼250 nt) |
| HyenaDNA | Hyena block (implicit convolution replacing attention), 2–8 layers | 1-mer (single nucleotide) | up to 1M tokens (single-nucleotide resolution) |
| GENERator | Decoder-only Transformer (Llama-like), 26 layers, 32 heads, 1.2B params | Fixed 6-mer, vocab size 4128 | 16,384 tokens (∼98 kbp) |
BPE and fixed k-mer tokenizations (6-mer in GENERator) extract sequence motifs, while single-nucleotide vocabularies (HyenaDNA) enable high-resolution modeling. RoPE position embeddings have demonstrated lower pretraining loss compared to alternatives such as ALiBi, and enable improved extrapolation for longer sequences (Li et al., 5 Feb 2025, Wu et al., 11 Feb 2025, Nguyen et al., 2023).
2. Pretraining Objectives and Optimization
Genomic next-token predictors employ standard autoregressive next-token prediction (NTP), formalized as:
where is the token at position and represents the output softmax distribution over the vocabulary, given the context .
Pretraining does not involve masking, and sequence encoding is strictly left-to-right (causal). Training typically occurs over large, deduplicated corpora comprising multi-species reference genomes (Omni-DNA), human reference genome (HyenaDNA), or gene-rich RefSeq datasets covering multiple taxa (GENERator). Optimization protocols include AdamW with warmup and decay schedules, mixed-precision training, and batch sizes designed to maximize hardware utilization via distributed data parallelism (e.g., FSDP, DeepSpeed) (Li et al., 5 Feb 2025, Nguyen et al., 2023, Wu et al., 11 Feb 2025).
3. Scaling, Data Regimes, and Training Setup
The effectiveness of next-token predictors is modulated by both model capacity and the scale/diversity of pretraining data:
- Omni-DNA: Trained on ∼30B unique nt (deduplicated from 170B nt), encoding ∼300B tokens over multiple epochs, with model sizes ranging 20M–1B. Sequences of 250 tokens are batch-processed (batch size 384) over 800K steps.
- HyenaDNA: Pretrained on the GRCh38 human genome (∼3.2B nt), with context lengths from 1K to 1M tokens. Utilizes small models (up to 6.6M params), but leverages sub-quadratic scaling for extremely long contexts.
- GENERator: Trained on 386B bp of annotated eukaryotic gene regions (30.9M protein-coding genes), with random 0–5 bp shifts prior to 6-mer tokenization and a batch size of ∼2M tokens (128 sequences per batch). Full pretraining comprised 6 epochs over 386B nucleotides and ∼185K steps on 32 A100 GPUs (Li et al., 5 Feb 2025, Nguyen et al., 2023, Wu et al., 11 Feb 2025).
Key hyperparameters and procedural elements are compared in the table below.
| Model | Corpus Size | Batch Size | Optimizer/Regime | Pretraining Steps |
|---|---|---|---|---|
| Omni-DNA | ∼30B nt | 384 | AdamW, β₁=0.9, β₂=0.95 | 800K |
| HyenaDNA | ∼3.2B nt | 64–256 | AdamW, β₁=0.9,β₂=0.999 | 10–20K |
| GENERator | 386B bp | ∼128 seqs | AdamW, β₁=0.9, β₂=0.95 | ∼185K |
4. Quantitative Performance and Benchmarking
Autoregressive next-token predictors have established superior performance relative to bidirectional masked LLMs (MLMs) and fixed attention models on standard benchmarks:
- Omni-DNA: On the Nucleotide Transformer (NT) downstream suite (18 tasks), Omni-DNA 1B attains an average score of 0.767 (MCC or F1, task-dependent), outperforming NT (2.5B, 0.709) and DNABERT-2 (120M, 0.679). On Genomic Benchmark (GB, 8 tasks), Omni-DNA (116M) achieves 0.879 accuracy, ranking first in 5 tasks and second in 3 (Li et al., 5 Feb 2025).
- HyenaDNA: For held-out human chromosomes, perplexity decreases with context length (e.g., 2.91 at 1M tokens, 8×256 model). HyenaDNA provides state-of-the-art results on 12/18 NT datasets and 7/8 GB datasets, often surpassing larger pretrained models with far fewer parameters and lower compute (Nguyen et al., 2023).
- GENERator: Delivers ∼15–20pp improvement in zero-shot next-K-mer accuracy over prior LMs, with next-16 bp accuracy of ∼58% at 512-token input (compared to ∼45% for BPE-8192 and ∼12% for 1-mer). The model maintains >65% accuracy for larger context sizes and outperforms NT-multi (2.5B, MLM) by 15–20pp across mammalian benchmarks (Wu et al., 11 Feb 2025).
These results confirm that scaling and well-designed tokenization, combined with long-range context modeling and autoregressive optimization, substantially improve both generation and comprehension of genomic sequences.
5. Multi-Task and Cross-Modal Finetuning
Genomic next-token predictors natively support multi-task and cross-modal extensions via vocabulary expansion and task prompting. This unified fine-tuning strategy allows the integration of K+ tasks within a single model. Task-specific tokens and response outputs are appended to the embedding matrices, and prompts are prepended to unify input formats (e.g., for 10 acetylation and methylation tasks in Omni-DNA).
Key strategies for maintaining output distribution stability include minimizing and carefully initializing new tokens, NEFTune-based embedding noise, and label token replication. Multi-task finetuning yields synergistic improvements (e.g., Omni-DNA@mult achieves average=0.739 across 10 tasks, surpassing single-task models), and supports cross-modal tasks such as DNA-to-text (DNA2Func) and DNA-to-image (“Needle-in-DNA”) with high macro F1 (e.g., 0.987 in DNA→image) (Li et al., 5 Feb 2025).
GENERator demonstrates prompt-based conditional sequence generation (e.g., designing promoter/enhancer activity profiles) and functional annotation, further highlighting the versatility of next-token autoregressive frameworks (Wu et al., 11 Feb 2025).
6. Alternative Sequence Architectures and Long-Range Modeling
HyenaDNA (Editor’s term: "implicit convolution transformer") replaces attention with implicit convolutional operators, achieving scaling per layer and enabling context lengths up to 1M tokens—orders of magnitude beyond transformer-based models. This long-range capability allows, for the first time, in-context learning in genomics through soft prompting (learnable embeddings) and few-shot demonstration paradigms, all without further updating the main weights. Standard in-context learning strategies (prefix tuning and k-shot demonstration) recover performance approaching fine-tuning, especially when larger numbers of learnable prompts are used (Nguyen et al., 2023).
A plausible implication is that alternative architectures with sub-quadratic complexity may play a critical role in extending next-token prediction to pan-genomic and variant effect modeling at full human/chromosomal scales.
7. Applications, Evaluation Paradigms, and Future Prospects
The unified next-token prediction paradigm enables diverse downstream genomic analyses:
- Classification: Single- or multi-label identification of sequence function (histone, promoter, splice, gene type, taxonomy).
- Functional Annotation: Sequence-to-text mapping (e.g., DNA2Func) producing gene or RNA class labels and descriptions.
- Generative Design: Synthesis of functional DNA, such as protein-coding sequences and regulatory elements, with outputs validated by protein folding tools (AlphaFold, TM-score).
- Cross-Modal Generation: Mapping DNA to high-dimensional outputs (e.g., images), enabled by pre-discretization (VQ-VAE) and prompt engineering.
- In-Context Genomics: Soft prompting/few-shot learning for rapid adaptation to new classification or annotation tasks (Li et al., 5 Feb 2025, Wu et al., 11 Feb 2025, Nguyen et al., 2023).
A next-token predictor trained on large-scale, raw DNA captures grammar and higher-order motifs, enabling model scaling, cross-task generalization, and new capabilities in generative and in-context learning. Extensions to even longer contexts, more flexible tokenization (e.g., state-space, recurrent, alternative vector quantization), and cross-omic applications represent active research directions.
In summary, genomic next-token predictors, exemplified by Omni-DNA, HyenaDNA, and GENERator, constitute a foundational modeling toolkit for modern computational genomics, fundamentally advancing the state-of-the-art in sequence modeling, function annotation, design, and interpretation (Li et al., 5 Feb 2025, Nguyen et al., 2023, Wu et al., 11 Feb 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free