Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 94 tok/s

Gemini 2.5 Pro 57 tok/s Pro

GPT-5 Medium 28 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 100 tok/s

GPT OSS 120B 461 tok/s Pro

Kimi K2 208 tok/s Pro

2000 character limit reached

BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model (2505.23579v1)

Published 29 May 2025 in cs.LG

Abstract: Unlocking deep, interpretable biological reasoning from complex genomic data is a major AI challenge hindering scientific discovery. Current DNA foundation models, despite strong sequence representation, struggle with multi-step reasoning and lack inherent transparent, biologically intuitive explanations. We introduce BioReason, a pioneering architecture that, for the first time, deeply integrates a DNA foundation model with a LLM. This novel connection enables the LLM to directly process and reason with genomic information as a fundamental input, fostering a new form of multimodal biological understanding. BioReason's sophisticated multi-step reasoning is developed through supervised fine-tuning and targeted reinforcement learning, guiding the system to generate logical, biologically coherent deductions. On biological reasoning benchmarks including KEGG-based disease pathway prediction - where accuracy improves from 88% to 97% - and variant effect prediction, BioReason demonstrates an average 15% performance gain over strong single-modality baselines. BioReason reasons over unseen biological entities and articulates decision-making through interpretable, step-by-step biological traces, offering a transformative approach for AI in biology that enables deeper mechanistic insights and accelerates testable hypothesis generation from genomic data. Data, code, and checkpoints are publicly available at https://github.com/bowang-lab/BioReason

Collections

Summary

Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model

This paper presents a novel approach to integrating DNA foundation models with LLMs to enhance biological reasoning capabilities, offering significant insights into genomic data interpretation. Current DNA foundation models have excelled in sequence representation for predicting variant effects and identifying regulatory elements, yet they often fail to articulate transparent, biologically intuitive explanations due to their "black box" nature. Conversely, LLMs, through techniques like reinforcement learning and supervised fine-tuning, demonstrate superior reasoning skills but lack the architecture to process raw genomic sequences effectively. The paper addresses these gaps by introducing \bioreason, a pioneering multimodal framework designed to foster a new paradigm in biological understanding.

Architecture and Methodology

\bioreason integrates a DNA foundation model, such as Evo2 or Nucleotide Transformer, with Qwen3 variants as the LLM backbone. This architecture enables direct processing of genomic sequences alongside text-based queries, facilitating a flow of information that combines DNA-based embeddings with natural language. Consequently, \bioreason formulates biologically coherent deductions and predictions articulated through natural language. The training methodology incorporates supervised fine-tuning alongside reinforcement learning via Group Relative Policy Optimization (GRPO). This multimodal synergy allows \bioreason to reason over unseen biological entities, significantly improving performance metrics on various benchmarks.

Evaluation and Results

The authors meticulously curated datasets such as ClinVar and OMIM, alongside a novel, high-quality KEGG-derived reasoning dataset, to benchmark \bioreason's capabilities. Evaluation metrics focused on multi-step mechanistic reasoning, disease pathway prediction, and variant effect predictions. \bioreason consistently surpassed single-modality models, demonstrating substantial gains in accuracy and interpretability, notably achieving a 97% accuracy rate on KEGG pathway predictions—a 15% improvement over baseline models.

On Variant Effect Prediction tasks, \bioreason's hybrid models outperformed both DNA-only and LLM-only approaches, highlighting its superior integration of sequence representation and logical deduction. These quantitative results underscore \bioreason's effectiveness in advancing testable hypothesis generation, thus promising practical advancements in precision medicine and genomics research.

Implications and Future Directions

The implications of this research are multi-faceted. Practically, \bioreason provides a potent tool for deciphering complex disease mechanisms and accelerating hypothesis generation from genomic data. Theoretically, it establishes a framework for enhancing AI-driven biological studies through multimodal integration. Future developments could explore expanding \bioreason's capabilities to encompass other biological sequences, further refining its reasoning framework across diverse datasets.

The paper identifies current limitations, such as computational overhead and dataset biases, acknowledging the need for refinement in uncertainty quantification and model scalability. Future advancements may involve incorporating orthologous sequences, adapting to RNA/protein modalities, and employing \bioreason in genome-wide studies.

Conclusion

\bioreason represents a significant advancement in computational biology by bridging DNA foundation models with LLMs. Its ability to generate interpretable reasoning traces offers a transformative tool in genomics, promising deeper mechanistic insights and enhanced scientific discovery. With ongoing refinement and expansion into new biological domains, \bioreason is poised to revolutionize AI applications in genomics, driving forward the field of precision medicine and beyond.

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (11)

GitHub

bowang-lab/BioReason · GitHub

Tweets

https://twitter.com/razoralign/status/1929607224969908686

https://twitter.com/BoWang87/status/1946787889750417821

https://twitter.com/arnavshah0/status/1929338000153936249

https://twitter.com/AllThingsApx/status/1929894629995925751

https://twitter.com/Dr_Alex_Crimi/status/1930383419080364105

https://twitter.com/tmramalho/status/1938436169764614248

YouTube

Show All Videos