OpenEvidence: Clinical Evidence Synthesis Tools
- OpenEvidence is a dual-system platform for clinical evidence synthesis and meta-analysis, leveraging advanced statistical and natural language approaches.
- It utilizes modular data pipelines, specialized retrieval methods, and interactive interfaces to integrate historical trial data with real-time clinical queries.
- Empirical evaluations demonstrate its superior accuracy and utility over general-purpose LLMs, emphasizing transparent methodology and reproducibility.
OpenEvidence (OE) denotes two distinct but conceptually aligned software systems for clinical evidence synthesis and decision support: (1) an open-source, meta-analytic JASP module for statistical aggregation of randomized trial results (Bartoš et al., 2023), and (2) a high-throughput, web-based clinical question answering platform specialized for point-of-care physician queries (Feng et al., 27 Jun 2026). Both are designed to contextualize new evidence against the historical record, leveraging advanced statistical or natural language modeling. OE is characterized by fit-for-purpose engineering around biomedicine: data pipelines for ingesting and structuring authoritative clinical information, domain-tailored retrieval and modeling, and output interfaces optimized for clinical interpretation and educational use.
1. Software Architecture, Technology Stack, and Data Pipelines
The JASP-based OE system is implemented as the “Cochrane Meta-Analysis” module, relying on a modular architecture (Bartoš et al., 2023):
- Data layer: Two compressed R-serialized (‘.rds’) files contain (i) continuous outcomes and (ii) dichotomous outcomes, each holding ≈1.5 million historical trial-level observations indexed from the Cochrane Database of Systematic Reviews (CDSR).
- Modeling layer: Statistical routines are called via R packages:
metaforfor classical (frequentist) meta-analysis,metaBMAfor Bayesian estimation and Bayes factor calculations, all orchestrated via Rserve and exposed through JASP’s internal module interface. - Frontend/backend: The interface consists of a JavaScript front end, C++ back end, and embedded R code. Interactivity is governed by dynamic metadata queries for rendering searchable selectors, subgroup toggling, and real-time update of visualizations.
- Data ingestion: CDSR meta-analysis tables (2000–2021) are automatically harvested via NCBI EUtils API, parsed from rm5 XML, and transformed into plain text tables with a reproducible build pipeline. Updates require rerunning ingestion scripts after each new CDSR release, with potential for full automation.
The web/API-based OE clinical QA platform (Feng et al., 27 Jun 2026) features:
- Data ingestion: Web/API interface accepts real physician queries; automated filters strip personal health information (PHI) and non-clinical requests.
- Question normalization: Lightweight, zero-temperature LLM-based rewriting removes identifiers and standardizes queries.
- Specialty tagging: Each question is associated with the submitter’s self-reported medical specialty (from National Provider Identifier registry).
- Evidence retrieval: Domain-specific search and ranking of biomedical literature (PubMed, guidelines, monographs); optional live web search for recency.
- Answer synthesis: A proprietary, deterministic LLM (fixed temperature, seed) generates answers with inline, uniform-format citations from retrieved evidence.
- Logging and delivery: Final answers and metadata are returned to physicians and logged for continuous analytics.
2. Methodological Foundations and Statistical Models
For the JASP/OE module (Bartoš et al., 2023):
- Classical meta-analysis is implemented using fixed and random effects models. For study with estimate and variance , pooled estimates are:
- Fixed effects: with .
- Random effects: with .
- Heterogeneity is assessed using -statistic, DerSimonian–Laird estimator for between-study variance , and index.
- Bayesian meta-analysis is realized via a hierarchical model:
0
Default priors are field-specific (e.g., 1, 2 for dichotomous epilepsy outcomes). Posterior and Bayes factors are computed in metaBMA using bridge sampling.
- Bayesian model averaging (BMA): OE combines fixed and random effect posteriors weighted by their predictive evidence:
3
where 4.
- Sequential updating: New trial results 5 are incorporated by updating the posterior, which metaBMA computes transparently.
For the clinical QA OE platform (Feng et al., 27 Jun 2026):
- Deterministic LLM inference, evidence-grounded synthesis, and systematic citation provision are central to answer reliability and verifiability.
3. User Interface, Workflow, and Visualization
JASP/OE Meta-Analysis Module
- Interface access via JASP’s module menu; parallel analysis panels for “Classical” and “Bayesian” dichotomous/continuous endpoints.
- Workflow: Database selector for meta-analysis retrieval; subgroup toggling; effect size and model selection; configuration of priors (Bayesian mode); manual addition of new studies; export options.
- Visualization: Forest plots (updated interactively), funnel plots (with Egger’s test), prior/posterior densities, cumulative meta-analysis graphs.
OE Clinical QA Platform
- Workflow: Free-text clinical query entry; automatic specialty classification; evidence retrieval and ranking; deterministic answer generation; inline citation presentation.
- This suggests the user experience is strongly tailored toward clinical workflow efficiency and clarity of documentation.
4. Evaluation Methodology and Empirical Results
A large-scale expert evaluation compared OE against state-of-the-art general-purpose models (GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.8) using 620 real-world point-of-care queries (“Real-POCQi”) and 187 HealthBench questions (Feng et al., 27 Jun 2026). Key features:
- Evaluator pool: 149 actively practicing U.S. physicians across 30 specialties and 36 states; each rated only specialty-matched questions.
- Blind, pairwise comparisons of text-only and text+citation answers; systems pseudonymized per query.
- Scoring axes: Accuracy, clinical utility, source quality, verifiability, completeness (all 5-point preference scales).
- Win-difference metric: 6, with 95% CIs via cluster bootstrapping.
OE achieved positive one-vs-rest win differences on all axes in primary analysis (all 7):
- Accuracy: 8 [18.4, 30.8]
- Clinical utility: 9 [21.5, 37.4]
- Source quality: 0 [31.7, 45.8]
- Completeness: 1 [23.0, 38.6]
- Verifiability: 2 [19.5, 32.8]
General-purpose LLMs never outperformed OE and, when their answers were shorter than comparators, had negative win differences. Sensitivity analyses stratified by citations, question type (authentic vs. exam-style), answer length, and user status produced consistent results.
LLM-generated ratings (“LLM judges”) agreed that OE was best overall but produced divergent rankings of the other LLMs and displayed judge bias and overconfidence. Ensemble LLM scoring did not resolve discrepancies with human specialist judgments.
5. Implementation, Extensibility, and Reproducibility
JASP/OE Module
- Open-source: Source code and build scripts at https://github.com/jasp-stats/jaspCochrane. Licensed under GNU Affero GPL v3.
- Installation: Requires JASP (≥0.17.1), R (≥4.0), necessary packages (
metafor,metaBMA,BayesFactor,Rserve), and periodically updated ‘.rds’ files. - Extensibility: Users may add new trials either interactively or by editing raw data files, specify custom priors, export data/plots, and fork/extend the module with new LMMs or routine types (e.g., network meta-analysis).
- Reproducibility: Analysis steps, from selection to output, are transparent and easily replicable.
OE Clinical QA Platform
- Full deployment details are proprietary, but high-level architecture is public. The Real-POCQi benchmark, response datasets, and evaluation code are released for research use.
6. Strengths, Limitations, and Future Directions
Strengths
- Evidence contextualization: Both systems support rapid incorporation and contextualization of new studies within the existing literature, with robust visual and quantitative diagnostics (Bartoš et al., 2023).
- Clinical relevance: Directly targets the realities of clinical decision making by optimizing for real, point-of-care queries and leveraging specialty-matched retrieval and synthesis (Feng et al., 27 Jun 2026).
- Open workflow: JASP module is entirely free/open-source and programmatically extensible.
- Educational transparency: Both systems emphasize reproducible, step-by-step interfaces for exploration and learning.
Limitations
- Scope: JASP module is limited to Cochrane-reviewed trials (≈16,000/1.5 million candidates); non-Cochrane systematic reviews are omitted due to formatting heterogeneity. Analysis is limited to univariate continuous/dichotomous outcomes (Bartoš et al., 2023).
- Updating: No fully automated database pipeline; rebuild scripts require periodic manual execution.
- Clinical QA platform: System details are partly proprietary; public code refers only to benchmarking but not inference engines.
Future Directions
- Workflow automation: Semi-automated extraction from PDFs or web sources, automated quality scoring, fully automated updating pipeline (Bartoš et al., 2023).
- Methodological expansion: Planned support for network meta-analysis and multivariate models.
- Cloud deployment: Targeted for collaborative, real-time usage without local installation.
- Publication bias integration: GUI options for selection models, p-curve adjustments.
- Benchmark advancement: Continued release and extension of real-world clinical query datasets and direct, specialist-graded evaluation (Feng et al., 27 Jun 2026).
7. Significance and Implications
OpenEvidence, in both open-source meta-analytic and specialized clinical QA instantiations, exemplifies domain-specific adaptation for medical evidence synthesis and decision support. Empirical evaluation demonstrates substantial performance gains of the specialized OE pipeline over general-purpose LLMs in accuracy, utility, source quality, verifiability, and completeness when tested on real-world queries by domain experts (Feng et al., 27 Jun 2026). These findings establish the criticality of specialty-aware information retrieval, deterministic evidence-grounded answer synthesis, and reproducible, contextually-aligned evaluation frameworks in the development and assessment of clinical AI systems. The open release of both analytic and evaluation components is positioned to facilitate ongoing methodological advances and benchmarking in clinical evidence synthesis (Bartoš et al., 2023, Feng et al., 27 Jun 2026).