HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution (2306.15794v2)

Published 27 Jun 2023 in cs.LG and q-bio.GN

Abstract: Genomic (DNA) sequences encode an enormous amount of information for gene regulation and protein synthesis. Similar to natural LLMs, researchers have proposed foundation models in genomics to learn generalizable features from unlabeled genome data that can then be fine-tuned for downstream tasks such as identifying regulatory elements. Due to the quadratic scaling of attention, previous Transformer-based genomic models have used 512 to 4k tokens as context (<0.001% of the human genome), significantly limiting the modeling of long-range interactions in DNA. In addition, these methods rely on tokenizers or fixed k-mers to aggregate meaningful DNA units, losing single nucleotide resolution where subtle genetic variations can completely alter protein function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a LLM based on implicit convolutions was shown to match attention in quality while allowing longer context lengths and lower time complexity. Leveraging Hyena's new long-range capabilities, we present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level - an up to 500x increase over previous dense attention-based models. HyenaDNA scales sub-quadratically in sequence length (training up to 160x faster than Transformer), uses single nucleotide tokens, and has full global context at each layer. We explore what longer context enables - including the first use of in-context learning in genomics. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data. On the GenomicBenchmarks, HyenaDNA surpasses SotA on 7 of 8 datasets on average by +10 accuracy points. Code at https://github.com/HazyResearch/hyena-dna.

Citations (151)

View on Semantic Scholar

Summary

The paper introduces HyenaDNA, which processes up to one million tokens at single nucleotide resolution, achieving a 500x context boost over prior models.
The paper achieves up to 160x faster training and state-of-the-art results on 12 of 18 genomic benchmarks, highlighting its computational efficiency.
The paper employs a Hyena architecture with implicit convolution to capture long-range genomic dependencies, enabling in-context learning for precision medicine.

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

The paper presents HyenaDNA, an advanced genomic model that leverages the capabilities of the Hyena LLM for genomic applications. It addresses the limitations of Transformer-based genomic models by offering improved context length processing and single nucleotide resolution. This work represents an integration of biology and advanced computational models to effectively decode and predict genomic sequences.

Key Contributions

Extended Context Length: HyenaDNA can process genomic sequences up to one million tokens long, maintaining single nucleotide resolution. This extension represents a significant up to 500x increase over previous models which were constrained by quadratic attention scaling, allowing only up to 4k tokens.
Single Nucleotide Resolution: Moving beyond fixed k-mers, HyenaDNA uses single nucleotide tokens, preserving the genetic nuances critical for tasks like SNP identification, without losing resolution.
Model Efficiency: HyenaDNA exhibits a sub-quadratic complexity in sequence length, offering up to 160x faster training than traditional Transformers. This efficiency is largely due to the implicit convolutional operations in Hyena.

Numerical Results

HyenaDNA achieves state-of-the-art performance across numerous benchmarks. It outperforms previous models on 12 of 18 datasets from the Nucleotide Transformer with significantly fewer parameters and less pretraining data. On GenomicBenchmarks, it improves accuracy by up to 20 points on specific tasks like enhancer identification.

Methodology

Hyena Architecture:

The model leverages Hyena's implicit convolution-based long-range capabilities. The architecture facilitates a global receptive field at each layer, critical for capturing long-range genomic dependencies. It incorporates a curriculum learning style warm-up for sequence length, improving training efficiency on ultralong sequences.

Theoretical and Practical Implications

The reach of HyenaDNA extends beyond typical genomic prediction tasks. It enables, for the first time, in-context learning within genomics, where the model adapts to new tasks without model retraining. This flexibility suggests a promising future for personalized genomic predictions at the patient level, benefitting precision medicine and therapeutic interventions.

Future Directions

The paper opens several avenues for further research. Incorporating diverse genomes for pretraining could enhance model universality and reduce biases. Extending the framework to incorporate multimodal biological data, beyond DNA, promises broader applicability analogous to advances seen in multimodal AI.

Overall, HyenaDNA sets a new precedent in genomic sequence modeling by balancing long-range interaction comprehension with precise nucleotide-level detail, all while maintaining computational feasibility. This work distinctly propels both the scope and depth of AI applications in genomics.

PDF Markdown

Related Papers

GitHub

GitHub - HazyResearch/hyena-dna: Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena (519 stars)

Tweets

https://twitter.com/liambai21/status/1766936320553746626

https://twitter.com/1027766058390716416/status/1733939465763299463

https://twitter.com/_masterofwolves/status/1756432760338518022