Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

2 1

BEND: Benchmarking DNA Language Models on biologically meaningful tasks (2311.12570v4)

Published 21 Nov 2023 in q-bio.GN and cs.LG

Abstract: The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised LLMing of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA LLMs have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA LLMs, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://github.com/frederikkemarin/BEND.

References (57)

Authors (7)

Frederikke Isa Marin (1 paper)
Felix Teufel (2 papers)
Marc Horlacher (1 paper)
Dennis Madsen (3 papers)
Dennis Pultz (2 papers)
Ole Winther (66 papers)
Wouter Boomsma (16 papers)

Citations (20)

View on Semantic Scholar

Summary

The paper introduces BEND, a benchmark that evaluates DNA language models on seven biologically meaningful tasks including gene finding and enhancer annotation.
The methodology uses a uniform CNN framework to assess various models, revealing that NT-MS excels in some tasks while struggling with long-range genomic interactions.
The results emphasize the need for improved strategies in modeling distant genomic dependencies, guiding future advancements in precision medicine and functional genomics.

Benchmarking DNA LLMs on Biologically Meaningful Tasks

The paper "BEND: Benchmarking DNA LLMs on Biologically Meaningful Tasks" introduces BEND, a benchmark designed for evaluating DNA LLMs (DNA LMs) on tasks derived from biologically significant processes. The rapid advances in genome sequencing contrast with the sluggish pace of annotating functional elements within these genomes, creating a fertile ground for the application of unsupervised LLMs to genomic data. BEND provides a standardized framework featuring seven curated tasks, enabling a comprehensive assessment of DNA LMs' capabilities.

Tasks and Approach

BEND covers a spectrum of genomic tasks, including gene finding, enhancer annotation, chromatin accessibility, histone modification, CpG methylation, and noncoding variant effect predictions. These tasks were selected to represent both local sequence understanding and the potential for capturing long-range genomic interactions. By integrating tasks of varying length scales and features, BEND offers a robust evaluation setup for DNA LMs.

The evaluation leverages a diverse suite of models, including several established DNA LMs like Nucleotide Transformer and DNABERT, alongside newly trained models such as an AWD-LSTM and a dilated ResNet LM. These models are assessed through a uniform framework where a two-layer CNN model processes the embeddings from each LM, keeping the original weights frozen. This approach ensures comparability and isolates the contribution of the unsupervised embeddings to the downstream performance.

Results and Insights

The results highlight the potential and limitations of existing DNA LMs. Notably, the Nucleotide Transformer trained on a multi-species dataset (NT-MS) emerges as a strong performer across most tasks, though it suffers from inconsistent outcomes when compared to specialized methods. For instance, while NT-MS performs comparably to the state-of-the-art gene finder AUGUSTUS, it does not surpass traditional supervised models like Basset for chromatin accessibility.

A critical takeaway is the challenge of modeling long-range dependencies in genomic data. Enhancer annotation, which inherently requires understanding interactions over tens of kilobases, remains particularly difficult for all models, emphasizing a gap in current LM capabilities. The sparse nature and extensive range of genomic signals demand refined architectures or novel training objectives that can effectively leverage distant contextual information.

The analysis of variant effect predictions also yields intriguing insights. Despite being unsupervised, certain DNA LMs demonstrate reasonable performance in variant effect prediction, especially DNABERT which shows potential when assessing the impact on gene expression. However, the overall low performance in zero-shot settings indicates the need for complementary approaches or enhancements in LM training paradigms.

Implications and Future Directions

The establishment of BEND highlights the progress and existing hurdles in using LLMs for genomic data analysis. The benchmark not only assesses model performance but also illuminates the specific genomic features captured by different embedding strategies, thus refining the development pipeline for future DNA LMs.

Such advancements could pivot the capabilities of bioinformatics tools, particularly in integrating genomic insights into practical applications like precision medicine and functional genomics research. The flexibility to include new tasks and expand the benchmark to additional organisms also positions BEND as a valuable resource in evaluating cross-species generalization, a critical aspect of transfer learning in genomics.

The journey towards effective long-range genomic modeling remains, urging the exploration of innovative model architectures or hybrid approaches that could better harness the complexity of genomic sequences. As DNA LMs continue to evolve, benchmarks like BEND will remain pivotal in standardizing evaluations and guiding methodological advancements in this dynamic field.

PDF Markdown

GitHub

GitHub - frederikkemarin/BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks (78 stars)

Tweets

https://twitter.com/felixgteufel/status/1766042434402714011

https://twitter.com/eyesrobson/status/1765159580907286764