BioDisco: Modular Hypothesis Generation Framework

Updated 18 May 2026

BioDisco is a modular multi-agent framework for automated biomedical hypothesis discovery that integrates language models, knowledge graphs, and literature retrieval.
It employs seven specialized agents that orchestrate iterative hypothesis refinement and scoring, leveraging both textual and structured evidence.
Its dual-mode evidence integration and temporal evaluation protocols enable statistically principled assessments of novelty and significance.

BioDisco is a modular multi-agent framework for automated hypothesis generation in biomedicine. It systematically integrates LLM-based reasoning with a dual-mode evidence system, leveraging both biomedical knowledge graphs and automated literature retrieval to ground hypotheses in up-to-date, high-precision evidence. BioDisco advances the state-of-the-art by embedding an internal iterative feedback and scoring mechanism, as well as a statistically principled temporal evaluation protocol for assessing future-discoverability and novelty. Flexibility in architecture, tool integration, and evaluation makes BioDisco practical for diverse hypothesis discovery pipelines in biomedical research (Ke et al., 2 Aug 2025).

1. Multi-Agent Orchestration and Workflow

BioDisco employs seven specialized agents coordinated via a central Planner (A₇). Each agent addresses a distinct phase of hypothesis generation and refinement:

Background (A₁): Retrieves and summarizes relevant literature from PubMed based on user input $U$ .
Explorer (A₂): Extracts a focused subgraph $G'\subseteq G$ from a biomedical knowledge graph (e.g., PrimeKG) by mapping user queries to graph nodes and relation subsets.
Scientist (A₃): Synthesizes an initial hypothesis set $H_0$ leveraging both textual and graph-based evidence $(B, G')$ .
Critic (A₄): Computes multi-dimensional feedback scores for each candidate hypothesis.
Reviewer (A₅): Analyses low-performing hypotheses for evidence gaps and triggers targeted expansions (additional literature or KG subgraphs).
Refiner (A₆): Revises hypotheses based on Reviewer feedback and new evidence.
Planner (A₇): Manages iteration, terminating early if hypotheses reach a threshold quality $S^*$ or iterating up to $T=3$ cycles.

The data flow at iteration $t$ is:

$H_t = \text{Refiner}_t(\text{Reviewer}_t(\text{Critic}_t(h_{t-1}, B_{t-1}, G'_{t-1})))$

with $B_t$ and $G'_t$ updated by new evidence from the Background and Explorer agents.

2. Dual-Mode Evidence Integration

Evidence collection in BioDisco comprises two channels:

a) Knowledge Graph (KG) Retrieval: User input $G'\subseteq G$ 0 is mapped to one or more graph nodes $G'\subseteq G$ 1 via cosine similarity search in the BioSimCSE–BioLinkBERT embedding space. The Explorer retrieves $G'\subseteq G$ 2 by traversing direct neighbors up to a configurable hop depth $G'\subseteq G$ 3 and filtering on relation types $G'\subseteq G$ 4. The resulting subgraph $G'\subseteq G$ 5 is summarized by an LLM.
b) Automated Literature Retrieval: The system formulates Boolean PubMed queries from keywords/hypotheses/critiques using TIAB/MeSH fields and temporal filters. If insufficient literature is retrieved, constraints are recursively relaxed. Abstracts and metadata are summarized into $G'\subseteq G$ 6.
c) Scoring Over Combined Evidence: The Critic agent produces total scores for each hypothesis according to:

$G'\subseteq G$ 7

with $G'\subseteq G$ 8 by default and each scoring term derived from four underlying components: novelty, relevance, significance, and verifiability.

Hypotheses $G'\subseteq G$ 9 are scored against four metrics, each mapped discretely $H_0$ 0 for $H_0$ 1, yielding a total

$H_0$ 2

If $H_0$ 3, the Reviewer prompts targeted evidence expansions. The refinement protocol is formalized as:

$(B, G')$ 9

This scoring-driven, agent-mediated loop improves groundedness and novelty in hypotheses across successive refinements.

4. Temporal and Comparative Evaluation

BioDisco introduces a temporal evaluation protocol that strictly constrains all evidence (LLM cutoff, KG version, PubMed times) to a fixed point $H_0$ 4, then measures alignment with post- $H_0$ 5 discoveries. Two key tasks are employed:

Semantic Similarity: Hypothesis $H_0$ 6 is compared to $H_0$ 7 using cosine distance in embedding space:

$H_0$ 8

On the Qi dataset, generated/gold pairs yield median $H_0$ 9 vs. negative controls at $(B, G')$ 0.

Relation Classification: Using TruthHypo (post-2024 entity pairs), a classifier maps $(B, G')$ 1. Macro F $(B, G')$ 2, accuracy = 0.85 on 300 held-out samples, with precision/recall per category shown below:

Category	Precision	Recall
Chemical–Gene	0.90	0.90
Disease–Gene	0.82	0.82
Gene–Gene	0.83	0.83

A plausible implication is that BioDisco's evidence integration and iterative cycle yield predictions more aligned with subsequent scientific findings than ablated architectures.

5. Bradley–Terry Paired Comparison and Human Judgement

Evaluation via the Bradley–Terry model is utilized for system ablation and for comparing human vs. LLM preferences in pairwise hypothesis assessment. The paired probability is

$(B, G')$ 3

with logit

$(B, G')$ 4

where $(B, G')$ 5 encodes system latent ability, $(B, G')$ 6 adjusts for presentation order. This model estimates non-overlapping 95% confidence intervals across the pipeline ablations, confirming that tool integration plus iterative refinement achieves the strongest preference scores for novelty and significance.

Human evaluation by domain experts in cardiovascular and immunology research, quantified via a Bayesian polytomous Rasch model,

$(B, G')$ 7

further indicates improvement in hypothesis ratings, especially for novelty and significance, after the refinement cycle.

6. Experimental Evidence and Ablation Results

Empirical results demonstrate that (i) BioDisco substantially improves semantic similarity to ground-truth hypotheses over unrelated controls, and (ii) achieves high F $(B, G')$ 8 on held-out relation classification. Ablation studies across four metrics (novelty, significance, relevance, verifiability) and five system configurations show monotonic increases in ability scores from single-LLM baselines through multi-agent, tooling, and iterative refinement stages, with the full BioDisco pipeline performing best. Tool access (KG + literature) provides greater benefit than iterative refinement alone, though both are synergistic.

7. System Modularity, Deployment, and Usage

BioDisco is available as a Python library with full modularity, allowing researchers to specify custom LLMs, connect to distinct biomedical KGs (via Neo4j, GraphQL, etc.), and substitute literature sources through an interchangeable retrieval API. Generation and refinement of hypotheses are accessible via a minimal code interface:

$S^*$ 0

The system's architecture, agent separation, and pluggability enable rapid adaptation to new biomedical domains and evidence sources with minimal code changes. This suggests tractability for integration into customized or larger-scale scientific discovery workflows (Ke et al., 2 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

BioDisco: Multi-agent hypothesis generation with dual-mode evidence, iterative feedback and temporal evaluation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BioDisco.

BioDisco: Modular Hypothesis Generation Framework

1. Multi-Agent Orchestration and Workflow

2. Dual-Mode Evidence Integration

3. Iterative Feedback, Scoring, and Refinement

4. Temporal and Comparative Evaluation

5. Bradley–Terry Paired Comparison and Human Judgement

6. Experimental Evidence and Ablation Results

7. System Modularity, Deployment, and Usage

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BioDisco: Modular Hypothesis Generation Framework

1. Multi-Agent Orchestration and Workflow

2. Dual-Mode Evidence Integration

3. Iterative Feedback, Scoring, and Refinement

4. Temporal and Comparative Evaluation

5. Bradley–Terry Paired Comparison and Human Judgement

6. Experimental Evidence and Ablation Results

7. System Modularity, Deployment, and Usage

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research