Papers
Topics
Authors
Recent
Search
2000 character limit reached

BioDisco: Modular Hypothesis Generation Framework

Updated 18 May 2026
  • BioDisco is a modular multi-agent framework for automated biomedical hypothesis discovery that integrates language models, knowledge graphs, and literature retrieval.
  • It employs seven specialized agents that orchestrate iterative hypothesis refinement and scoring, leveraging both textual and structured evidence.
  • Its dual-mode evidence integration and temporal evaluation protocols enable statistically principled assessments of novelty and significance.

BioDisco is a modular multi-agent framework for automated hypothesis generation in biomedicine. It systematically integrates LLM-based reasoning with a dual-mode evidence system, leveraging both biomedical knowledge graphs and automated literature retrieval to ground hypotheses in up-to-date, high-precision evidence. BioDisco advances the state-of-the-art by embedding an internal iterative feedback and scoring mechanism, as well as a statistically principled temporal evaluation protocol for assessing future-discoverability and novelty. Flexibility in architecture, tool integration, and evaluation makes BioDisco practical for diverse hypothesis discovery pipelines in biomedical research (Ke et al., 2 Aug 2025).

1. Multi-Agent Orchestration and Workflow

BioDisco employs seven specialized agents coordinated via a central Planner (A₇). Each agent addresses a distinct phase of hypothesis generation and refinement:

  • Background (A₁): Retrieves and summarizes relevant literature from PubMed based on user input UU.
  • Explorer (A₂): Extracts a focused subgraph GGG'\subseteq G from a biomedical knowledge graph (e.g., PrimeKG) by mapping user queries to graph nodes and relation subsets.
  • Scientist (A₃): Synthesizes an initial hypothesis set H0H_0 leveraging both textual and graph-based evidence (B,G)(B, G').
  • Critic (A₄): Computes multi-dimensional feedback scores for each candidate hypothesis.
  • Reviewer (A₅): Analyses low-performing hypotheses for evidence gaps and triggers targeted expansions (additional literature or KG subgraphs).
  • Refiner (A₆): Revises hypotheses based on Reviewer feedback and new evidence.
  • Planner (A₇): Manages iteration, terminating early if hypotheses reach a threshold quality SS^* or iterating up to T=3T=3 cycles.

The data flow at iteration tt is:

Ht=Refinert(Reviewert(Critict(ht1,Bt1,Gt1)))H_t = \text{Refiner}_t(\text{Reviewer}_t(\text{Critic}_t(h_{t-1}, B_{t-1}, G'_{t-1})))

with BtB_t and GtG'_t updated by new evidence from the Background and Explorer agents.

2. Dual-Mode Evidence Integration

Evidence collection in BioDisco comprises two channels:

  • a) Knowledge Graph (KG) Retrieval: User input GGG'\subseteq G0 is mapped to one or more graph nodes GGG'\subseteq G1 via cosine similarity search in the BioSimCSE–BioLinkBERT embedding space. The Explorer retrieves GGG'\subseteq G2 by traversing direct neighbors up to a configurable hop depth GGG'\subseteq G3 and filtering on relation types GGG'\subseteq G4. The resulting subgraph GGG'\subseteq G5 is summarized by an LLM.
  • b) Automated Literature Retrieval: The system formulates Boolean PubMed queries from keywords/hypotheses/critiques using TIAB/MeSH fields and temporal filters. If insufficient literature is retrieved, constraints are recursively relaxed. Abstracts and metadata are summarized into GGG'\subseteq G6.
  • c) Scoring Over Combined Evidence: The Critic agent produces total scores for each hypothesis according to:

GGG'\subseteq G7

with GGG'\subseteq G8 by default and each scoring term derived from four underlying components: novelty, relevance, significance, and verifiability.

3. Iterative Feedback, Scoring, and Refinement

Hypotheses GGG'\subseteq G9 are scored against four metrics, each mapped discretely H0H_00 for H0H_01, yielding a total

H0H_02

If H0H_03, the Reviewer prompts targeted evidence expansions. The refinement protocol is formalized as:

(B,G)(B, G')9

This scoring-driven, agent-mediated loop improves groundedness and novelty in hypotheses across successive refinements.

4. Temporal and Comparative Evaluation

BioDisco introduces a temporal evaluation protocol that strictly constrains all evidence (LLM cutoff, KG version, PubMed times) to a fixed point H0H_04, then measures alignment with post-H0H_05 discoveries. Two key tasks are employed:

  • Semantic Similarity: Hypothesis H0H_06 is compared to H0H_07 using cosine distance in embedding space:

H0H_08

On the Qi dataset, generated/gold pairs yield median H0H_09 vs. negative controls at (B,G)(B, G')0.

  • Relation Classification: Using TruthHypo (post-2024 entity pairs), a classifier maps (B,G)(B, G')1. Macro F(B,G)(B, G')2, accuracy = 0.85 on 300 held-out samples, with precision/recall per category shown below:
Category Precision Recall
Chemical–Gene 0.90 0.90
Disease–Gene 0.82 0.82
Gene–Gene 0.83 0.83

A plausible implication is that BioDisco's evidence integration and iterative cycle yield predictions more aligned with subsequent scientific findings than ablated architectures.

5. Bradley–Terry Paired Comparison and Human Judgement

Evaluation via the Bradley–Terry model is utilized for system ablation and for comparing human vs. LLM preferences in pairwise hypothesis assessment. The paired probability is

(B,G)(B, G')3

with logit

(B,G)(B, G')4

where (B,G)(B, G')5 encodes system latent ability, (B,G)(B, G')6 adjusts for presentation order. This model estimates non-overlapping 95% confidence intervals across the pipeline ablations, confirming that tool integration plus iterative refinement achieves the strongest preference scores for novelty and significance.

Human evaluation by domain experts in cardiovascular and immunology research, quantified via a Bayesian polytomous Rasch model,

(B,G)(B, G')7

further indicates improvement in hypothesis ratings, especially for novelty and significance, after the refinement cycle.

6. Experimental Evidence and Ablation Results

Empirical results demonstrate that (i) BioDisco substantially improves semantic similarity to ground-truth hypotheses over unrelated controls, and (ii) achieves high F(B,G)(B, G')8 on held-out relation classification. Ablation studies across four metrics (novelty, significance, relevance, verifiability) and five system configurations show monotonic increases in ability scores from single-LLM baselines through multi-agent, tooling, and iterative refinement stages, with the full BioDisco pipeline performing best. Tool access (KG + literature) provides greater benefit than iterative refinement alone, though both are synergistic.

7. System Modularity, Deployment, and Usage

BioDisco is available as a Python library with full modularity, allowing researchers to specify custom LLMs, connect to distinct biomedical KGs (via Neo4j, GraphQL, etc.), and substitute literature sources through an interchangeable retrieval API. Generation and refinement of hypotheses are accessible via a minimal code interface:

SS^*0

The system's architecture, agent separation, and pluggability enable rapid adaptation to new biomedical domains and evidence sources with minimal code changes. This suggests tractability for integration into customized or larger-scale scientific discovery workflows (Ke et al., 2 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BioDisco.