Papers
Topics
Authors
Recent
2000 character limit reached

BioSage: AI for Cross-Disciplinary Research

Updated 30 November 2025
  • BioSage is a compound AI architecture that integrates LLMs, retrieval-augmented generation, and agents to overcome knowledge silos in interdisciplinary research.
  • Its modular design supports summarization, research debate, and brainstorming with transparent, citation-backed evidence logging.
  • Benchmark evaluations show significant performance gains over standard LLM and RAG approaches, validating its cross-domain efficacy.

BioSage is a compound AI architecture designed for cross-disciplinary knowledge retrieval and synthesis, particularly targeting challenges in AI, data science, biomedical, and biosecurity research domains. The system integrates LLMs, retrieval-augmented generation (RAG), and orchestrated specialized agents to facilitate scientific discovery, research collaboration, and evidence-based reasoning. BioSage emphasizes transparency, traceability, and user-centered workflows, supporting activities such as summarization, research debate, and brainstorming across traditionally siloed disciplines, while offering robust benchmarked performance improvements compared to LLM and basic RAG baselines (Volkova et al., 23 Nov 2025).

1. System Architecture

BioSage’s architecture comprises a user-facing front end and a multi-agent backend orchestrator (Figure 1, Figure 2). The front end features a Query UI that captures natural-language questions, presents multi-step agent reasoning, and preserves conversational context. An Interaction DB logs each user turn and agent step, providing transparency and supporting ongoing analysis.

The core system orchestrates a pipeline of specialized agents:

  • Query Planning Agent acts as a semantic router, parsing the user’s question to emit either domain tags (v1) or high-precision search keywords (v2), determining which downstream agents are invoked.
  • Retrieval Agent executes a five-stage workflow: query planning, domain-evidence gathering via RAG, text selection, latent knowledge probing, and cross-domain synthesis, resulting in citation-backed answers (Figure 3).
  • Translation Agents (Figure 4) facilitate cross-domain reasoning both via persistent memory for terminological mappings and via explicit agent-to-agent dialogue that bridges conceptual gaps between disparate fields.
  • Reasoning Agents (Figures 7 and 8) implement a hierarchical approach: macro-reasoning (e.g., evidentiary thread integration), micro-reasoning (stepwise logic with self-correction), and a metacognitive layer that refines strategies dynamically.

The system is modular, supporting fan-out queries across separate domain indices (biology, AI/ML, biosecurity), with RAG built on hybrid semantic and keyword search (LlamaIndex). Embeddings are computed using sentence-transformers/all-MiniLM-L6-v2, and all retrieval operations are indexed via OpenSearch.

A dedicated ethics and safety module enforces dual-use risk assessment and automated guardrails during agent interactions.

2. Algorithms and Agent Methodologies

BioSage’s agent workflows are grounded in explicit prompt templates (Appendix A.1) and algorithmic pipelines:

  • Query Planning utilizes two prompt-template versions: v1 emits domain tags; v2 emits search keywords. The v2 pseudocode is:

1
2
3
4
def plan_query_v2(question):
    prompt = fill_template(v2_template, question)
    response = LLM(prompt)
    return parse_json(response)["keywords"]

  • Retrieval Scoring computes embedding similarity with cosine distance:

score(q,d)=cos(eq,ed)where eq,edRn\text{score}(q, d) = \cos( \mathbf{e}_q, \mathbf{e}_d )\quad\text{where } \mathbf{e}_q, \mathbf{e}_d \in \mathbb{R}^n

These scores are combined with term-based TF-IDF scores via weighted sum:

shybrid=αssemantic+(1α)stfidfs_{\text{hybrid}} = \alpha s_{\text{semantic}} + (1-\alpha) s_{\text{tfidf}}

  • Agent Orchestration employs the Query Planner as a semantic router, dispatching queries to Retrieval, Translation, and Reasoning agents. Each agent’s input, output, series of actions, and confidence scores are logged in the Interaction DB. Agents may re-enter execution loops until evidentiary synthesis completeness is achieved.
  • Response Synthesis uses a “Synthesist” LLM prompt, merging multi-domain notes, contradiction resolution, and JSON formatting:

1
2
3
evidence_list = collect_all_evidence()
synthesis_prompt = fill_template(synthesist_template, question, evidence_list)
final_answer = LLM(synthesis_prompt)

3. User Workflows and Interaction Design

BioSage is engineered around direct integration with scientific workflows and user transparency (Section 3.1). Three primary workflows are supported (Figure 1):

  • Summarization: condenses hundreds of papers into concise, citation-backed digests.
  • Research Debate: surfaces contradictory evidence, promoting critical engagement with alternative hypotheses.
  • Brainstorming: promotes hypothesis generation by linking findings from divergent research fields.

Human–agent interaction is made transparent through visible tool invocation events (e.g., “Invoking Translation Agent for biochem→ML gap”) and structured JSON outputs to facilitate programmatic parsing by downstream analysis tools and scripts. Interaction DB logging supports iterative user experience refinement.

Design principles dictate integration into primary research workflows, maximal transparency through exposure of intermediate rationales, and conversation memory for context continuity. Activity logs feed into ongoing UX improvement cycles.

4. Benchmark Evaluation and Quantitative Outcomes

BioSage’s efficacy is assessed across several scientific QA benchmarks:

Benchmark Agent/Model Baseline (%) Agent (%) Absolute Δ (pp) Relative Gain
LitQA2 GPT-4o 20.2 29.6 +9.4 46.5%
LitQA2 Llama 3.1 70B 13.6 29.0 +15.4
WMDP GPT-4o 69.0 92.5 +23.5 34.1%
WMDP Llama 3.1 70B 71.5 87.0 +15.5
GPQA Llama 3.1 70B 36.0 42.2 +6.2
HLE-Bio GPT-4o 5.9 6.3 +0.4
Cross-discipline GPT-4o + RAG +11.1
Cross-discipline Retrieval Agent v2 +5.0

Performance improvements of 13–21% over vanilla LLM and RAG approaches are consistently documented for both Llama 3.1 70B and GPT-4o across the four main tasks [(Volkova et al., 23 Nov 2025), Figure 5, Figure 6].

Causal investigation of agent interventions (Section 4.4, Figure 7) via NOTEARS [Zheng et al. 2018] and CausalNex [QuantumBlack 2020] revealed substantive effects. Addition of the Retrieval Agent increased Type–Token Ratio (TTR) by +0.16, Noun Ratio by +0.11, SMOG Index (readability) by +2.26, and WMDP performance by +0.22 absolute. Vanilla RAG produced mixed effects: TTR improved (+0.24) but overall net performance marginally decreased (−0.02), supporting the design rationale for agent-based orchestration.

5. Specialized Agents and Cognitive Layers

The Translation Agents provide two modes: (i) persistent memory, where conceptual mappings are stored in a single agent’s memory, and (ii) multi-agent dialogue, where “out-of-domain” and “in-domain” experts negotiate explicit mappings. This mechanism operationalizes semantic alignment between disparate research subfields (Figure 4).

Reasoning Agents are hierarchically arranged. The macro-reasoning layer integrates multi-source evidence into high-level insights. The micro-reasoning layer performs sequential logical transformations and supports “second-thought” iterations for self-correction. The metacognitive layer dynamically evaluates and refines overall reasoning strategies over iterative interaction cycles (Figures 7, 8).

Interaction and output data from these agents is logged at every step, enabling transparent auditability and downstream traceability.

6. Current Directions and Future Work

BioSage’s ongoing development aims to extend capability into multimodal scientific data retrieval and reasoning (Section 6):

  • Multimodal RAG: RAG is being enhanced to support ingestion of figures, charts, and tables using large vision–LLMs (LVLMs), drawing on relevant methodologies [Huang et al. 2023, Huang et al. 2024, Zhou et al. 2023].
  • Document–Image Joint Embeddings: Approaches are under development for indexing and querying structured data derived from scientific documents and associated imagery.
  • Benchmark Development: Inspired by MMBench [Liu et al. 2024], MMMU [Yue et al. 2024], and LMMs-Eval [Zhang et al. 2025], plans are advancing to create comprehensive multimodal, cross-disciplinary tasks, incorporating real-world laboratory protocols and bioinformatics charts.
  • User Studies: Human–AI interaction studies are evaluating real scientists using BioSage’s research-centric workflows in laboratory and field settings (volkova2025virtlab).

A plausible implication is that by integrating these multimodal capabilities with systematic agent orchestration, BioSage could reduce knowledge silos and accelerate translational insights across scientific disciplines.

References to Key Figures, Tables, and Data

  • Figure 1: User-centric design and transparent multi-agent workflow.
  • Figure 2: Backend architecture with semantic routing and agent orchestration.
  • Figure 5 & 9: Benchmark result plots (LitQA2, WMDP, GPQA, HLE-Bio, cross-disciplinary).
  • Figure 3: Retrieval Agent workflow.
  • Figure 4: Translation Agent architectures.
  • Figures 7 & 8: Reasoning agent layers.
  • Figure 7: Causal effect heatmaps of agent interventions.
  • Appendix A.1–A.3: Agent prompt templates and implementation details.

The BioSage compound AI architecture exemplifies the fusion of LLMs, advanced retrieval, agent-based reasoning, and human-centric design, yielding measurable benefit for cross-disciplinary scientific research (Volkova et al., 23 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to BioSage.