Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome (2306.15006v2)

Published 26 Jun 2023 in q-bio.GN, cs.AI, cs.CE, and cs.CL

Abstract: Decoding the linguistic intricacies of the genome is a crucial problem in biology, and pre-trained foundational models such as DNABERT and Nucleotide Transformer have made significant strides in this area. Existing works have largely hinged on k-mer, fixed-length permutations of A, T, C, and G, as the token of the genome language due to its simplicity. However, we argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair Encoding (BPE), a statistics-based data compression algorithm that constructs tokens by iteratively merging the most frequent co-occurring genome segment in the corpus. We demonstrate that BPE not only overcomes the limitations of k-mer tokenization but also benefits from the computational efficiency of non-overlapping tokenization. Based on these insights, we introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints, reduce time and memory expenditure, and enhance model capability. Furthermore, we identify the absence of a comprehensive and standardized benchmark for genome understanding as another significant impediment to fair comparative analysis. In response, we propose the Genome Understanding Evaluation (GUE), a comprehensive multi-species genome classification dataset that amalgamates $36$ distinct datasets across $9$ tasks, with input lengths ranging from $70$ to $10000$. Through comprehensive experiments on the GUE benchmark, we demonstrate that DNABERT-2 achieves comparable performance to the state-of-the-art model with $21 \times$ fewer parameters and approximately $92 \times$ less GPU time in pre-training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods, 18(10):1196–1203, 2021.
  2. On the opportunities and risks of foundation models, 2022.
  3. High coverage whole genome sequencing of the expanded 1000 genomes project cohort including 602 trios. bioRxiv, 2021. doi: 10.1101/2021.02.06.430068. URL https://www.biorxiv.org/content/early/2021/02/07/2021.02.06.430068.
  4. Capturing large genomic contexts for accurately predicting enhancer-promoter interactions. Brief. Bioinform., 23(2), March 2022.
  5. ENCODE Project Consortium et al. An integrated encyclopedia of dna elements in the human genome. Nature, 489(7414):57, 2012.
  6. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  7. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 933–941. JMLR.org, 2017.
  8. Epd and epdnew, high-quality promoter resources in the next-generation sequencing era. Nucleic acids research, 41(D1):D157–D164, 2013.
  9. Lora: Low-rank adaptation of large language models, 2021.
  10. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
  11. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations. Genome Biol., 23(1):219, October 2022.
  12. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  13. Gisaid’s role in pandemic response. China CDC weekly, 3(49):1049, 2021.
  14. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018.
  15. Bert-promoter: An improved sequence-based predictor of dna promoter using bert pre-trained model and shap feature selection. Computational Biology and Chemistry, 99:107732, 2022.
  16. Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. Nature Communications, 13(1):6678, 2022.
  17. Applications of deep learning in understanding gene regulation. Cell Reports Methods, 3(1):100384, 2023. ISSN 2667-2375. doi: https://doi.org/10.1016/j.crmeth.2022.100384. URL https://www.sciencedirect.com/science/article/pii/S2667237522002892.
  18. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. 2023.
  19. Decoupled weight decay regularization, 2019.
  20. EPI-Mind: Identifying enhancer-promoter interactions based on transformer mechanism. Interdiscip. Sci., 14(3):786–794, September 2022.
  21. OpenAI. Gpt-4 technical report, 2023.
  22. Training language models to follow instructions with human feedback, 2022.
  23. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  25. The EN-TEx resource of multi-tissue personal epigenomes & variant-impact models. Cell, 186(7):1493–1511.e40, March 2023.
  26. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
  27. Noam Shazeer. Glu variants improve transformer, 2020.
  28. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol., 24(1):116, May 2023.
  29. An encyclopedia of mouse dna elements (mouse encode). Genome biology, 13(8):1–5, 2012.
  30. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  31. How to fine-tune bert for text classification?, 2020.
  32. The Mosaic ML Team. composer. https://github.com/mosaicml/composer/, 2021.
  33. Transfer learning enables predictions in network biology. Nature, 618(7965):616–624, June 2023.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Splicefinder: ab initio prediction of splice sites using convolutional neural network. BMC bioinformatics, 20:1–13, 2019.
  36. Towards a better understanding of tf-dna binding prediction from genomic features. Computers in Biology and Medicine, page 105993, 2022.
  37. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  38. ipro-wael: a comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Research, 50(18):10278–10289, 2022.
Citations (120)

Summary

  • The paper introduces a novel BPE-based genomic tokenization that overcomes the inefficiencies of traditional k-mer approaches by reducing redundancy and computational load.
  • It integrates advanced techniques like ALiBi and Flash Attention to enhance model performance on large-scale genomic benchmarks.
  • The launch of the GUE benchmark highlights DNABERT-2’s strength in cross-species analysis, achieving top rankings with substantially fewer computational resources.

An Analysis of DNABERT-2: A Foundation Model for Multi-Species Genomics

The paper "DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes" introduces a novel approach to genome tokenization and foundation model training, offering insights into effective genomic data analysis. This work addresses key inefficiencies of previous genomic LLMs, particularly those relying on k-mer tokenization, and introduces alternative methodologies to enhance computational and sample efficiencies.

Theoretical and Methodological Insights

The authors focus primarily on the inefficiencies inherent in k-mer tokenization, both overlapping and non-overlapping versions. These traditional methods have been popular due to their straightforward application to genomic data, but they incur penalties in information precision and computational load. Overlapping k-mer tokenization, while extensively informative, leads to excessive redundancy and information leakage, making the masked LLMing tasks trivial and thus inefficient. Non-overlapping k-mers, while computationally more efficient, struggle with sample inefficiency due to the distinct token representations for near-identical inputs.

To overcome these drawbacks, the authors propose using Byte Pair Encoding (BPE) for genomic tokenization. This approach, extensively validated in natural language processing, offers a balance between efficiency and efficacy by generating a statistical and compact representation of genomic sequences. In computational experiments, the BPE approach has shown to significantly reduce sequence length, enhancing processing efficiency and learning effectiveness.

Implementation and Benchmarking

Based on these insights, the paper presents DNABERT-2, a new foundation model trained with an optimally-sized BPE vocabulary. DNABERT-2 not only bypasses the limitations of k-mer but also leverages advancements such as Attention with Linear Biases (ALiBi) and Flash Attention to further enhance its capability and flexibility. These components collectively address the model's ability to process varying input sequence lengths and improve computational efficiency, crucially allowing it to outperform other models such as DNABERT and Nucleotide Transformer variants on large-scale genomic benchmarks.

The introduction of the Genome Understanding Evaluation (GUE) benchmark is a notable contribution. This benchmark dataset is meticulously curated to ensure a well-calibrated difficulty that accurately reflects the models' capabilities, considering a wide range of tasks from species classification to transcription factor prediction across multiple species. Serving as a standardized platform for model evaluation, GUE facilitates fair comparative analyses and could stimulate further advancements in genomic modeling.

Numerical and Performance Outcomes

Significant numerical outcomes from the experiments underscore the improvements achieved by DNABERT-2. It achieves performance comparable to state-of-the-art models with considerably fewer parameters and less computational overhead—92 times fewer GPU hours compared to NT-2500M-multi. Moreover, post further domain-specific training, DNABERT-2 demonstrates incremental performance gains.

On the GUE benchmark, DNABERT-2 consistently ranks at the top, particularly excelling in cross-species analytical tasks, affirming the effectiveness of multi-species pre-training. Its robustness in handling a diverse array of genomic tasks, including those with very large input sequences, is illustrative of the model's applicability in broad genomic studies.

Implications and Future Directions

Practically, the methodologies presented bear profound implications for genomic data processing, potentially impacting areas like transcriptional regulation understanding, variant effect predictions, and genetic disorder analysis. The adaptability and efficiency of DNABERT-2 make it a valuable tool for genome-scale modeling and bioinformatics applications.

Theoretically, this work encourages the genomic modeling community to reconsider tokenization strategies, exploring data-efficient techniques beyond traditional methodologies. BPE's application in genome sequences sets a precedent for blending NLP advancements with genomics.

Future research may extend towards optimizing short-sequence modeling and tackling the inherent challenges in extremely long genomic sequences. Additionally, exploring training targets that consider the bidirectional nature of DNA may yield further advancements in model performance and biological inference.

Overall, DNABERT-2 stands as a significant step forward in genomic LLMing, simultaneously addressing computational challenges and setting new standards for benchmarking in genome analysis.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com