Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BEND: Benchmarking DNA Language Models on biologically meaningful tasks (2311.12570v4)

Published 21 Nov 2023 in q-bio.GN and cs.LG

Abstract: The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised LLMing of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA LLMs have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA LLMs, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://github.com/frederikkemarin/BEND.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology, 33(8):831–838, August 2015. ISSN 1087-0156, 1546-1696. doi: 10.1038/nbt.3300. URL https://www.nature.com/articles/nbt.3300.
  2. MoDNA: motif-oriented pre-training for DNA language model. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp.  1–5, Northbrook Illinois, August 2022. ACM. ISBN 978-1-4503-9386-7. doi: 10.1145/3535508.3545512. URL https://dl.acm.org/doi/10.1145/3535508.3545512.
  3. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biology, 18(1):67, December 2017. ISSN 1474-760X. doi: 10.1186/s13059-017-1189-z. URL http://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1189-z.
  4. A global reference for human genetic variation. Nature, 526(7571):68–74, October 2015. ISSN 1476-4687. doi: 10.1038/nature15393. URL https://www.nature.com/articles/nature15393. Number: 7571 Publisher: Nature Publishing Group.
  5. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10):1196–1203, October 2021. ISSN 1548-7105. doi: 10.1038/s41592-021-01252-x. URL https://www.nature.com/articles/s41592-021-01252-x. Number: 10 Publisher: Nature Publishing Group.
  6. Regulation of chromatin by histone modifications. Cell Research, 21(3):381–395, March 2011. ISSN 1748-7838. doi: 10.1038/cr.2011.22. URL https://doi.org/10.1038/cr.2011.22.
  7. DNA language models are powerful zero-shot predictors of genome-wide variant effects. bioRxiv, pp.  2022.08.22.504706, January 2023. doi: 10.1101/2022.08.22.504706. URL http://biorxiv.org/content/early/2023/04/12/2022.08.22.504706.abstract.
  8. Learning the protein language: Evolution, structure, and function. Cell Systems, 12(6):654–669.e3, June 2021. ISSN 24054712. doi: 10.1016/j.cels.2021.05.017. URL https://linkinghub.elsevier.com/retrieve/pii/S2405471221002039.
  9. A sequence-based global map of regulatory activity for deciphering human genetics. Nature Genetics, 54(7):940–949, July 2022. ISSN 1061-4036, 1546-1718. doi: 10.1038/s41588-022-01102-2. URL https://www.nature.com/articles/s41588-022-01102-2.
  10. Self-supervised learning for DNA sequences with circular dilated convolutional networks. preprint, Bioinformatics, February 2023. URL http://biorxiv.org/lookup/doi/10.1101/2023.01.30.526193.
  11. The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics. bioRxiv, pp.  2023.01.11.523679, January 2023. doi: 10.1101/2023.01.11.523679. URL http://biorxiv.org/content/early/2023/01/15/2023.01.11.523679.abstract.
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018. doi: 10.48550/ARXIV.1810.04805. URL https://arxiv.org/abs/1810.04805. Publisher: arXiv Version Number: 2.
  13. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112–7127, October 2022. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2021.3095381. URL https://ieeexplore.ieee.org/document/9477085/.
  14. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414):57–74, September 2012. ISSN 1476-4687. doi: 10.1038/nature11247.
  15. GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences, June 2023. URL https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1. Pages: 2023.06.12.544594 Section: New Results.
  16. GENCODE 2021. Nucleic Acids Research, 49(D1):D916–D923, January 2021. ISSN 0305-1048. doi: 10.1093/nar/gkaa1087. URL https://doi.org/10.1093/nar/gkaa1087.
  17. Disease variant prediction with deep generative models of evolutionary data. Nature, 599(7883):91–95, November 2021. ISSN 1476-4687. doi: 10.1038/s41586-021-04043-8. URL https://doi.org/10.1038/s41586-021-04043-8.
  18. Activity-by-contact model of enhancer–promoter regulation from thousands of CRISPR perturbations. Nature Genetics, 51(12):1664–1669, December 2019. ISSN 1546-1718. doi: 10.1038/s41588-019-0538-0. URL https://www.nature.com/articles/s41588-019-0538-0. Number: 12 Publisher: Nature Publishing Group.
  19. Species-aware DNA language modeling. bioRxiv, pp.  2023.01.26.525670, January 2023. doi: 10.1101/2023.01.26.525670. URL http://biorxiv.org/content/early/2023/01/27/2023.01.26.525670.abstract.
  20. A Genome-wide Framework for Mapping Gene Regulation via Cellular Genetic Screens. Cell, 176(1):377–390.e19, January 2019. ISSN 0092-8674. doi: 10.1016/j.cell.2018.11.029. URL https://www.sciencedirect.com/science/article/pii/S009286741831554X.
  21. Datasheets for Datasets. 2018. doi: 10.48550/ARXIV.1803.09010. URL https://arxiv.org/abs/1803.09010. Publisher: arXiv Version Number: 8.
  22. J. Gorodkin. Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry, 28(5):367–374, December 2004. ISSN 1476-9271. doi: 10.1016/j.compbiolchem.2004.09.006. URL https://www.sciencedirect.com/science/article/pii/S1476927104000799.
  23. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data, 24(1):25, May 2023. ISSN 2730-6844. doi: 10.1186/s12863-023-01123-8. URL https://doi.org/10.1186/s12863-023-01123-8.
  24. The ENCODE Uniform Analysis Pipelines. preprint, Bioinformatics, April 2023. URL http://biorxiv.org/lookup/doi/10.1101/2023.04.04.535623.
  25. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, August 2021. ISSN 1367-4803. doi: 10.1093/bioinformatics/btab083. URL https://doi.org/10.1093/bioinformatics/btab083.
  26. Data navigation on the ENCODE portal. 2023. doi: 10.48550/ARXIV.2305.00006. URL https://arxiv.org/abs/2305.00006. Publisher: arXiv Version Number: 2.
  27. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Research, 26(7):990–999, July 2016. ISSN 1549-5469. doi: 10.1101/gr.200535.115.
  28. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Research, 28(5):739–750, May 2018. ISSN 1088-9051, 1549-5469. doi: 10.1101/gr.227819.117. URL http://genome.cshlp.org/lookup/doi/10.1101/gr.227819.117.
  29. Critical assessment of methods of protein structure prediction (CASP)—Round XIV. Proteins: Structure, Function, and Bioinformatics, 89(12):1607–1617, 2021. ISSN 1097-0134. doi: 10.1002/prot.26237. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.26237. _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/prot.26237.
  30. ClinVar: improvements to accessing data. Nucleic Acids Research, 48(D1):D835–D844, January 2020. ISSN 0305-1048. doi: 10.1093/nar/gkz972. URL https://doi.org/10.1093/nar/gkz972.
  31. GRASP: analysis of genotype–phenotype results from 1390 genome-wide association studies and corresponding open access database. Bioinformatics, 30(12):i185–i194, June 2014. ISSN 1367-4803. doi: 10.1093/bioinformatics/btu273. URL https://doi.org/10.1093/bioinformatics/btu273.
  32. FloraBERT: cross-species transfer learning withattention-based neural networks for geneexpression prediction. preprint, In Review, August 2022. URL https://www.researchsquare.com/article/rs-1927200/v1.
  33. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, March 2023. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.ade2574. URL https://www.science.org/doi/10.1126/science.ade2574.
  34. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Research, 48(D1):D882–D889, January 2020. ISSN 1362-4962. doi: 10.1093/nar/gkz1062.
  35. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, January 2023. ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-022-01618-2. URL https://www.nature.com/articles/s41587-022-01618-2.
  36. The Ensembl Variant Effect Predictor. Genome Biology, 17(1):122, June 2016. ISSN 1474-760X. doi: 10.1186/s13059-016-0974-4. URL https://doi.org/10.1186/s13059-016-0974-4.
  37. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56–65, November 2012. ISSN 1476-4687. doi: 10.1038/nature11632. URL https://doi.org/10.1038/nature11632.
  38. Regularizing and Optimizing LSTM Language Models. 2017. doi: 10.48550/ARXIV.1708.02182. URL https://arxiv.org/abs/1708.02182. Publisher: arXiv Version Number: 1.
  39. HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution. 2023. doi: 10.48550/ARXIV.2306.15794. URL https://arxiv.org/abs/2306.15794. Publisher: arXiv Version Number: 1.
  40. Hyena Hierarchy: Towards Larger Convolutional Language Models, April 2023. URL http://arxiv.org/abs/2302.10866. arXiv:2302.10866 [cs].
  41. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation, April 2022. URL http://arxiv.org/abs/2108.12409. arXiv:2108.12409 [cs].
  42. Evaluating Protein Transfer Learning with TAPE. Advances in neural information processing systems, 32:9689–9701, December 2019. ISSN 1049-5258. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7774645/.
  43. Transformer protein language models are unsupervised structure learners. preprint, Synthetic Biology, December 2020. URL http://biorxiv.org/lookup/doi/10.1101/2020.12.15.422761.
  44. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. preprint, Synthetic Biology, April 2019. URL http://biorxiv.org/lookup/doi/10.1101/622803.
  45. Transfer learning identifies sequence determinants of cell-type specific regulatory element accessibility. NAR Genomics and Bioinformatics, 5(2):lqad026, March 2023. ISSN 2631-9268. doi: 10.1093/nargab/lqad026. URL https://academic.oup.com/nargab/article/doi/10.1093/nargab/lqad026/7092956.
  46. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics, 21(1):293, December 2020. ISSN 1471-2164. doi: 10.1186/s12864-020-6707-9. URL https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-6707-9.
  47. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Research, 27(5):849–864, May 2017. ISSN 1549-5469. doi: 10.1101/gr.213611.116.
  48. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, September 2016. ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btw427. URL https://academic.oup.com/bioinformatics/article/32/17/i639/2450757.
  49. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics (Oxford, England), 19 Suppl 2:ii215–225, October 2003. ISSN 1367-4811. doi: 10.1093/bioinformatics/btg1080.
  50. GraphPart: homology partitioning for biological sequence analysis. NAR Genomics and Bioinformatics, 5(4):lqad088, October 2023. ISSN 2631-9268. doi: 10.1093/nargab/lqad088. URL https://academic.oup.com/nargab/article/doi/10.1093/nargab/lqad088/7318077.
  51. BERTology Meets Biology: Interpreting Attention in Protein Language Models, March 2021. URL http://arxiv.org/abs/2006.15222. arXiv:2006.15222 [cs, q-bio].
  52. PEER: A Comprehensive and Multi-Task Benchmark for Protein Sequence Understanding. Advances in Neural Information Processing Systems, 35:35156–35173, December 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/e467582d42d9c13fa9603df16f31de6d-Abstract-Datasets_and_Benchmarks.html.
  53. LOGO, a contextualized pre-trained language model of human genome flexibly adapts to various downstream tasks by fine-tuning. preprint, In Review, August 2021. URL https://www.researchsquare.com/article/rs-448927/v1.
  54. Big Bird: Transformers for Longer Sequences. In Advances in Neural Information Processing Systems, volume 33, pp.  17283–17297. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/c8512d142a2d849725f31a9a7a361ab9-Abstract.html.
  55. Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12(10):931–934, October 2015. ISSN 1548-7091, 1548-7105. doi: 10.1038/nmeth.3547. URL https://www.nature.com/articles/nmeth.3547.
  56. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biology, 20(1):244, November 2019. ISSN 1474-760X. doi: 10.1186/s13059-019-1835-8. URL https://doi.org/10.1186/s13059-019-1835-8.
  57. DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome, June 2023. URL http://arxiv.org/abs/2306.15006. arXiv:2306.15006 [cs, q-bio].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Frederikke Isa Marin (1 paper)
  2. Felix Teufel (2 papers)
  3. Marc Horlacher (1 paper)
  4. Dennis Madsen (3 papers)
  5. Dennis Pultz (2 papers)
  6. Ole Winther (66 papers)
  7. Wouter Boomsma (16 papers)
Citations (20)

Summary

  • The paper introduces BEND, a benchmark that evaluates DNA language models on seven biologically meaningful tasks including gene finding and enhancer annotation.
  • The methodology uses a uniform CNN framework to assess various models, revealing that NT-MS excels in some tasks while struggling with long-range genomic interactions.
  • The results emphasize the need for improved strategies in modeling distant genomic dependencies, guiding future advancements in precision medicine and functional genomics.

Benchmarking DNA LLMs on Biologically Meaningful Tasks

The paper "BEND: Benchmarking DNA LLMs on Biologically Meaningful Tasks" introduces BEND, a benchmark designed for evaluating DNA LLMs (DNA LMs) on tasks derived from biologically significant processes. The rapid advances in genome sequencing contrast with the sluggish pace of annotating functional elements within these genomes, creating a fertile ground for the application of unsupervised LLMs to genomic data. BEND provides a standardized framework featuring seven curated tasks, enabling a comprehensive assessment of DNA LMs' capabilities.

Tasks and Approach

BEND covers a spectrum of genomic tasks, including gene finding, enhancer annotation, chromatin accessibility, histone modification, CpG methylation, and noncoding variant effect predictions. These tasks were selected to represent both local sequence understanding and the potential for capturing long-range genomic interactions. By integrating tasks of varying length scales and features, BEND offers a robust evaluation setup for DNA LMs.

The evaluation leverages a diverse suite of models, including several established DNA LMs like Nucleotide Transformer and DNABERT, alongside newly trained models such as an AWD-LSTM and a dilated ResNet LM. These models are assessed through a uniform framework where a two-layer CNN model processes the embeddings from each LM, keeping the original weights frozen. This approach ensures comparability and isolates the contribution of the unsupervised embeddings to the downstream performance.

Results and Insights

The results highlight the potential and limitations of existing DNA LMs. Notably, the Nucleotide Transformer trained on a multi-species dataset (NT-MS) emerges as a strong performer across most tasks, though it suffers from inconsistent outcomes when compared to specialized methods. For instance, while NT-MS performs comparably to the state-of-the-art gene finder AUGUSTUS, it does not surpass traditional supervised models like Basset for chromatin accessibility.

A critical takeaway is the challenge of modeling long-range dependencies in genomic data. Enhancer annotation, which inherently requires understanding interactions over tens of kilobases, remains particularly difficult for all models, emphasizing a gap in current LM capabilities. The sparse nature and extensive range of genomic signals demand refined architectures or novel training objectives that can effectively leverage distant contextual information.

The analysis of variant effect predictions also yields intriguing insights. Despite being unsupervised, certain DNA LMs demonstrate reasonable performance in variant effect prediction, especially DNABERT which shows potential when assessing the impact on gene expression. However, the overall low performance in zero-shot settings indicates the need for complementary approaches or enhancements in LM training paradigms.

Implications and Future Directions

The establishment of BEND highlights the progress and existing hurdles in using LLMs for genomic data analysis. The benchmark not only assesses model performance but also illuminates the specific genomic features captured by different embedding strategies, thus refining the development pipeline for future DNA LMs.

Such advancements could pivot the capabilities of bioinformatics tools, particularly in integrating genomic insights into practical applications like precision medicine and functional genomics research. The flexibility to include new tasks and expand the benchmark to additional organisms also positions BEND as a valuable resource in evaluating cross-species generalization, a critical aspect of transfer learning in genomics.

The journey towards effective long-range genomic modeling remains, urging the exploration of innovative model architectures or hybrid approaches that could better harness the complexity of genomic sequences. As DNA LMs continue to evolve, benchmarks like BEND will remain pivotal in standardizing evaluations and guiding methodological advancements in this dynamic field.

Github Logo Streamline Icon: https://streamlinehq.com