BioDisco: Modular Hypothesis Generation Framework
- BioDisco is a modular multi-agent framework for automated biomedical hypothesis discovery that integrates language models, knowledge graphs, and literature retrieval.
- It employs seven specialized agents that orchestrate iterative hypothesis refinement and scoring, leveraging both textual and structured evidence.
- Its dual-mode evidence integration and temporal evaluation protocols enable statistically principled assessments of novelty and significance.
BioDisco is a modular multi-agent framework for automated hypothesis generation in biomedicine. It systematically integrates LLM-based reasoning with a dual-mode evidence system, leveraging both biomedical knowledge graphs and automated literature retrieval to ground hypotheses in up-to-date, high-precision evidence. BioDisco advances the state-of-the-art by embedding an internal iterative feedback and scoring mechanism, as well as a statistically principled temporal evaluation protocol for assessing future-discoverability and novelty. Flexibility in architecture, tool integration, and evaluation makes BioDisco practical for diverse hypothesis discovery pipelines in biomedical research (Ke et al., 2 Aug 2025).
1. Multi-Agent Orchestration and Workflow
BioDisco employs seven specialized agents coordinated via a central Planner (A₇). Each agent addresses a distinct phase of hypothesis generation and refinement:
- Background (A₁): Retrieves and summarizes relevant literature from PubMed based on user input .
- Explorer (A₂): Extracts a focused subgraph from a biomedical knowledge graph (e.g., PrimeKG) by mapping user queries to graph nodes and relation subsets.
- Scientist (A₃): Synthesizes an initial hypothesis set leveraging both textual and graph-based evidence .
- Critic (A₄): Computes multi-dimensional feedback scores for each candidate hypothesis.
- Reviewer (A₅): Analyses low-performing hypotheses for evidence gaps and triggers targeted expansions (additional literature or KG subgraphs).
- Refiner (A₆): Revises hypotheses based on Reviewer feedback and new evidence.
- Planner (A₇): Manages iteration, terminating early if hypotheses reach a threshold quality or iterating up to cycles.
The data flow at iteration is:
with and updated by new evidence from the Background and Explorer agents.
2. Dual-Mode Evidence Integration
Evidence collection in BioDisco comprises two channels:
- a) Knowledge Graph (KG) Retrieval: User input 0 is mapped to one or more graph nodes 1 via cosine similarity search in the BioSimCSE–BioLinkBERT embedding space. The Explorer retrieves 2 by traversing direct neighbors up to a configurable hop depth 3 and filtering on relation types 4. The resulting subgraph 5 is summarized by an LLM.
- b) Automated Literature Retrieval: The system formulates Boolean PubMed queries from keywords/hypotheses/critiques using TIAB/MeSH fields and temporal filters. If insufficient literature is retrieved, constraints are recursively relaxed. Abstracts and metadata are summarized into 6.
- c) Scoring Over Combined Evidence: The Critic agent produces total scores for each hypothesis according to:
7
with 8 by default and each scoring term derived from four underlying components: novelty, relevance, significance, and verifiability.
3. Iterative Feedback, Scoring, and Refinement
Hypotheses 9 are scored against four metrics, each mapped discretely 0 for 1, yielding a total
2
If 3, the Reviewer prompts targeted evidence expansions. The refinement protocol is formalized as:
9
This scoring-driven, agent-mediated loop improves groundedness and novelty in hypotheses across successive refinements.
4. Temporal and Comparative Evaluation
BioDisco introduces a temporal evaluation protocol that strictly constrains all evidence (LLM cutoff, KG version, PubMed times) to a fixed point 4, then measures alignment with post-5 discoveries. Two key tasks are employed:
- Semantic Similarity: Hypothesis 6 is compared to 7 using cosine distance in embedding space:
8
On the Qi dataset, generated/gold pairs yield median 9 vs. negative controls at 0.
- Relation Classification: Using TruthHypo (post-2024 entity pairs), a classifier maps 1. Macro F2, accuracy = 0.85 on 300 held-out samples, with precision/recall per category shown below:
| Category | Precision | Recall |
|---|---|---|
| Chemical–Gene | 0.90 | 0.90 |
| Disease–Gene | 0.82 | 0.82 |
| Gene–Gene | 0.83 | 0.83 |
A plausible implication is that BioDisco's evidence integration and iterative cycle yield predictions more aligned with subsequent scientific findings than ablated architectures.
5. Bradley–Terry Paired Comparison and Human Judgement
Evaluation via the Bradley–Terry model is utilized for system ablation and for comparing human vs. LLM preferences in pairwise hypothesis assessment. The paired probability is
3
with logit
4
where 5 encodes system latent ability, 6 adjusts for presentation order. This model estimates non-overlapping 95% confidence intervals across the pipeline ablations, confirming that tool integration plus iterative refinement achieves the strongest preference scores for novelty and significance.
Human evaluation by domain experts in cardiovascular and immunology research, quantified via a Bayesian polytomous Rasch model,
7
further indicates improvement in hypothesis ratings, especially for novelty and significance, after the refinement cycle.
6. Experimental Evidence and Ablation Results
Empirical results demonstrate that (i) BioDisco substantially improves semantic similarity to ground-truth hypotheses over unrelated controls, and (ii) achieves high F8 on held-out relation classification. Ablation studies across four metrics (novelty, significance, relevance, verifiability) and five system configurations show monotonic increases in ability scores from single-LLM baselines through multi-agent, tooling, and iterative refinement stages, with the full BioDisco pipeline performing best. Tool access (KG + literature) provides greater benefit than iterative refinement alone, though both are synergistic.
7. System Modularity, Deployment, and Usage
BioDisco is available as a Python library with full modularity, allowing researchers to specify custom LLMs, connect to distinct biomedical KGs (via Neo4j, GraphQL, etc.), and substitute literature sources through an interchangeable retrieval API. Generation and refinement of hypotheses are accessible via a minimal code interface:
0
The system's architecture, agent separation, and pluggability enable rapid adaptation to new biomedical domains and evidence sources with minimal code changes. This suggests tractability for integration into customized or larger-scale scientific discovery workflows (Ke et al., 2 Aug 2025).