Papers
Topics
Authors
Recent
Search
2000 character limit reached

The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review

Published 6 Sep 2024 in cs.DL and cs.AI | (2409.04600v1)

Abstract: Objective: This study aims to summarize the usage of LLMs in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field. Materials and Methods: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar databases by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model. ChatGPT was used to clean extracted data and generate code for figures in this manuscript, ChatGPT and Scite.ai were used in drafting all components of the manuscript, except the methods and discussion sections. Results: 3,788 articles were retrieved, and 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n=126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n=26, 15.1%) were actual reviews that used LLM during their creation. Most citations focused on automation of a particular stage of review, such as Searching for publications (n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled performance of GPT-based and BERT-based models, the former were better in data extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8), while being slightly less accurate in title and abstract screening stage (Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. The results looked promising, and we anticipate that LLMs will change in the near future the way the scientific reviews are conducted.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about

This paper looks at how LLMs—computer programs like ChatGPT that read and write—can help scientists do “systematic reviews.” A systematic review is a careful, step‑by‑step summary of what many studies say about the same topic. The authors didn’t just talk about LLMs—they used LLM tools themselves to run a full review of research on using LLMs in reviews.

The main questions the paper asks

The authors wanted to find out:

  • Which parts of a review can LLMs help automate?
  • Which LLMs are being used most?
  • How well do these tools perform on different review tasks?
  • How much time can they save, and what are the limits?
  • How many papers actually used LLMs to create a real review, not just test a small piece?

How the study was done (in everyday language)

Think of a systematic review like an assembly line for sorting and summarizing lots of studies. It has stages, such as:

  • Searching for studies
  • Screening titles and abstracts (the short summary)
  • Reading full texts
  • Extracting data (pulling out key facts)
  • Checking quality and bias
  • Summarizing findings
  • Writing and making figures

Here’s what the team did:

  • They searched four big databases (PubMed, Scopus, Dimensions, Google Scholar) in June 2024 to find papers about using LLMs in reviews.
  • They put all the papers into a platform called Covidence, which helps manage review steps.
  • They added their own LLM plug-in to Covidence using OpenAI’s GPT‑4o model. This plug-in helped with screening and data extraction.
  • To make the LLM more reliable, they used “calibration” (testing and adjusting prompts on small samples) and “majority vote” (running the model three times and choosing the most common answer).
  • Human reviewers worked alongside the LLM for checks: two humans set rules and tested prompts; humans verified full‑text results and corrected the LLM’s mistakes when needed.
  • They measured performance with simple stats:
    • Accuracy: How often the tool was right.
    • Precision: Of the studies it chose, how many were actually correct.
    • Recall: Of all the correct studies, how many did it find.
  • They also tracked time saved and costs.

What they found and why it matters

Key results explained simply:

  • Out of 3,788 papers found, 172 met the rules for inclusion.
  • ChatGPT/GPT‑based models were used most (about 73% of the time). BERT‑based models were second (about 19%).
  • Most projects automated just one stage of the review—not the whole process. The most common stages automated were:
    • Searching for publications (35%)
    • Data extraction (31%)
    • Title/abstract screening (25%)
    • Summarizing evidence (19%)
  • Only 26 papers (about 15%) were actual reviews that used LLMs during their creation. Most were “methods papers” showing how a piece of automation could work.

Performance highlights:

  • Data extraction: GPT models did better than BERT overall.
    • GPT precision ≈ 83% (picked correct items most of the time)
    • GPT recall ≈ 86% (found most of the correct items)
  • Title/abstract screening: BERT had slightly higher accuracy than GPT (≈ 81% vs. ≈ 77%), but differences depended on the specific task.
  • Geography: Most studies came from the US, then Australia, the UK, China, and Germany.
  • Opinions: Most authors were positive about using LLMs (about 70%), with some mixed views and a few negative.
  • Quality: About 42% of studies were rated high quality based on simple criteria (like journal impact factor and reporting metrics).
  • Time and cost: Using LLMs saved an estimated 334 person‑hours with about $500 in compute costs.

Why this matters:

  • These results show LLMs can make reviews faster and help with tough, time‑consuming parts—especially pulling out key facts (data extraction).
  • Even with promising results, humans still need to supervise to catch errors and ensure trust.

What this means for the future

Simple takeaways:

  • LLMs will likely change how scientists do reviews by speeding up dull, repetitive tasks.
  • They may help “democratize” reviews—making them easier for more people and teams to run.
  • Full end‑to‑end automated reviews are still rare but coming.
  • Ethical and quality concerns remain: LLMs can be convincing even when wrong, so transparency (disclosing LLM use and prompts) and human oversight are important.
  • For now, LLMs look especially suited to scoping reviews and parts of systematic reviews; more complex meta‑analyses (with numbers and statistics) need extra care.

Bottom line

LLMs like ChatGPT can be powerful helpers in scientific reviews—especially for finding and extracting information—saving time and effort. While they’re not perfect and still need human guidance, they’re already making a positive difference and are likely to become a regular part of how science organizes and uses evidence.

Practical Applications

Overview

This paper systematically assesses how LLMs are being used to automate stages of scientific reviews and demonstrates an end-to-end, domain-agnostic workflow integrated into Covidence using GPT-4o. Key findings include:

  • Strong performance and broad adoption of GPT-based models for data extraction (mean precision ≈83%, recall ≈86), with BERT-based models slightly better for title/abstract screening accuracy.
  • A practical, calibrated, and auditable human-in-the-loop workflow (prompt testing, three-pass majority vote, escalation rules) saving ~334 person-hours at ~$500 compute cost.
  • Immediate utility in searching for publications, screening, data extraction, and drafting/summarization, with cautious use for full meta-analysis.
  • Growing acceptance and public funding, plus an emerging need for transparency and ethical standards (e.g., PRISMA extensions for LLM-assisted reviews).

Below are actionable applications derived from these findings, categorized by deployment horizon.

Immediate Applications

These can be operationalized with current tools, models, and workflows demonstrated in the paper (e.g., GPT-4o, Covidence add-on, Scite.ai, NotebookLM), with human oversight.

  • LLM Co-Reviewer for Title/Abstract Screening (Academia, Healthcare, Publishing)
    • Use case: Rapid screening of thousands of citations with human calibration and majority voting (three-run inference), escalating “unclear” cases to humans.
    • Tools/workflows: Covidence LLM plugin; prompt calibration on 30 abstracts; 100-article consensus benchmarking; ensemble (GPT/BERT) pairing; majority vote + escalation.
    • Assumptions/dependencies: Access to LLM APIs; robust prompt design; adherence to PRISMA; defined thresholds for reconciliation of false positives; data privacy policies.
  • LLM-Assisted Full-Text Data Extraction (Academia, HTA Agencies, Pharma, Environmental Science)
    • Use case: Structured extraction (author/year, review type, model, metrics, sample sizes, funding) from PDFs with precision checks and human verification for low-accuracy fields.
    • Tools/workflows: GPT-4o extraction templates; precision thresholds (<80% triggers manual validation); R data.frame outputs; OCR when needed.
    • Assumptions/dependencies: High-quality and accessible PDFs; domain-relevant extraction schemas; validation datasets; governance for data reliability.
  • AI-Enhanced Search Strategy and Publication Discovery (Librarianship, R&D, Policy)
    • Use case: Query expansion, synonym/keyword generation, translation, and initial triage of results to increase recall without sacrificing precision.
    • Tools/workflows: LLM-generated search strings; integrations with PubMed/Scopus/Dimensions APIs; de-duplication pipelines.
    • Assumptions/dependencies: API access and terms of use; monitoring for query drift; evaluation against gold standards; reproducible logs.
  • Evidence Summarization and Drafting Co-Pilot (Academia, Healthcare, Policy Briefs)
    • Use case: Draft narrative sections, summary tables, and evidence syntheses to accelerate review writing, followed by expert edits.
    • Tools/workflows: Scite.ai and NotebookLM for source-grounded drafting; GPT for narrative synthesis; citation tracking and consistency checks.
    • Assumptions/dependencies: Human verification to mitigate factual inconsistency or overconfident statements; clear provenance; disclosure of LLM involvement.
  • Code and Plot Generation for Review Figures (Academia, Publishers, Data Teams)
    • Use case: Generate reproducible R code for figures (PRISMA flows, model performance plots) to reduce clerical overhead.
    • Tools/workflows: GPT-generated code snippets with human validation; version control; reproducible environments.
    • Assumptions/dependencies: Correct environment setup; code review; reproducibility auditing.
  • Multi-Model Stage Pairing Pipeline (Software Vendors, Research Offices)
    • Use case: Optimize pipeline by pairing BERT-based models for screening and GPT-based models for extraction/summarization based on observed strengths.
    • Tools/workflows: Orchestrators (Python/R) to route tasks per model; performance tracking dashboards; stage-specific prompts.
    • Assumptions/dependencies: Model availability; cost management; documented performance thresholds and fallback rules.
  • Time-Cost ROI Adoption for Review Units (Research Consortia, CROs, HTA Agencies)
    • Use case: Institutional deployment of the calibrated workflow to realize ~300+ person-hours saved per review at modest compute cost (~$500 as reported).
    • Tools/workflows: ROI calculators; standard operating procedures; training modules; audit trails.
    • Assumptions/dependencies: Adequate staff training; API budgets; leadership buy-in; oversight structures.
  • Transparency and Disclosure Templates (Journals, Universities, Funders)
    • Use case: Standardize reporting of LLM usage (prompts, sections affected, tools used) to meet ethical and compliance expectations.
    • Tools/workflows: PRISMA-LLM or PRISMA-FLLM checklists; manuscript disclosure templates; OSF protocol registration.
    • Assumptions/dependencies: Journal acceptance; policy alignment; culture of transparency.
  • Reviewer Training and Capacity Building (Education, Libraries)
    • Use case: Incorporate calibrated LLM workflows into methods courses, librarian training, and doctoral seminars to democratize evidence synthesis skills.
    • Tools/workflows: Sandbox projects; prompt libraries; benchmarking exercises; human-in-the-loop QA practices.
    • Assumptions/dependencies: Access to LLMs; curriculum updates; institutional licenses.
  • Quality Assurance Workflow for AI-Assisted Reviews (All Sectors)
    • Use case: Implement precision/recall thresholds, consensus benchmarks, and escalation pathways across stages (screening, extraction, synthesis).
    • Tools/workflows: Benchmarks (e.g., 100 abstracts, 30 full texts); category-specific precision thresholds; audit tools; inter-rater reliability tracking.
    • Assumptions/dependencies: Reference datasets; governance; consistent metric reporting; periodic recalibration.

Long-Term Applications

These require further research, scaling, validation, and/or policy development before routine deployment.

  • End-to-End Semi/Fully Automated Systematic Reviews Including Meta-Analysis (HTA, Guideline Bodies, Global Health)
    • Use case: Continuous pipelines that not only screen and extract but also perform quantitative meta-analyses, effect size computation, and sensitivity analyses.
    • Potential products: “Meta-Analysis Assistant” modules; numerical data extraction engines for outcomes; automated forest plot generators.
    • Assumptions/dependencies: Robust numeric extraction accuracy; reliable risk-of-bias assessment; causal interpretation safeguards; regulator/journal acceptance.
  • Living Evidence Ecosystems Integrated with Bibliographic Databases (Publishers, Policy)
    • Use case: Always-on “living reviews” that update as new citations appear, alert stakeholders, and maintain versioned syntheses.
    • Potential products: Evidence monitoring dashboards; event-driven update services; knowledge graph-backed repositories.
    • Assumptions/dependencies: Stable APIs; versioning and provenance frameworks; subscription and governance models.
  • Domain-Specific, Fine-Tuned, Privacy-Preserving Local LLMs (Healthcare, Finance, Government)
    • Use case: On-prem or private-cloud models (BERT, LLaMA, Mistral) fine-tuned on domain corpora to meet privacy and compliance needs.
    • Potential products: “On-Prem Evidence Synthesizer”; secure model hosting; federated fine-tuning toolkits.
    • Assumptions/dependencies: Compute resources; fine-tuning datasets; MLOps for updates; regulatory compliance (HIPAA/GDPR).
  • Standardization and Accreditation for AI-Assisted Reviews (Regulators, Journals, Funders)
    • Use case: Formal checklists, audits, and certifications (e.g., PRISMA extensions for LLMs) to validate processes and ensure trustworthiness.
    • Potential products: Accreditation programs; audit tools capturing prompts, decisions, and metrics; repositories for registered protocols.
    • Assumptions/dependencies: Multi-stakeholder consensus; pilot evaluations; legal and ethical frameworks.
  • Multi-Agent Reviewer Systems and Cross-Model Ensembles (Software, Research Automation)
    • Use case: Independent agents (models) debating or cross-verifying decisions for higher confidence, especially in complex or ambiguous cases.
    • Potential products: Debate-style LLM orchestration; self-consistency pipelines; confidence scorers.
    • Assumptions/dependencies: Cost-effective orchestration; robust disagreement resolution strategies; measurable gains over single-model setups.
  • Automated Risk-of-Bias and Quality Assessment (Academia, HTA)
    • Use case: Automate tools like ROBIS or RoB 2 judgments, study design classification, and bias detection to standardize assessments.
    • Potential products: Bias detection modules; structured critique generators; training datasets of annotated bias judgments.
    • Assumptions/dependencies: Large, high-quality labeled corpora; domain-generalizable methods; human validation loops.
  • Integration into HTA and Regulatory Submissions (Pharma, MedTech, Regulators)
    • Use case: AI-prepared evidence packages for submissions and post-market surveillance with traceable provenance and QA.
    • Potential products: “Regulatory Evidence Packager”; safety signal detection from literature; audit-ready provenance chains.
    • Assumptions/dependencies: Policy acceptance; reproducible audit trails; standardized performance metrics; liability frameworks.
  • Educational Credentialing in AI-Assisted Review Methods (Universities, Professional Bodies)
    • Use case: Micro-credentials and certificates for LLM-based evidence synthesis methods to professionalize skills.
    • Potential products: Methodology tracks; capstone projects; competency assessments.
    • Assumptions/dependencies: Consensus curricula; access to tools; accreditation bodies’ involvement.
  • Commercial SaaS for “Review-as-a-Service” (Software, Publishers, CROs)
    • Use case: End-to-end platforms offering calibrated pipelines, living updates, and team collaboration, with compliance features.
    • Potential products: “Evidence Synthesis Studio”; “AI Reviewer Co-Pilot”; “Search Strategy Wizard”; cross-domain industry verticals (e.g., ESG/finance).
    • Assumptions/dependencies: Security and compliance (SOC2/ISO27001); integration with enterprise systems; pricing models; customer trust.
  • Cross-Domain Expansion (Finance, Energy, Legal/Compliance)
    • Use case: Rapid literature reviews for risk management, ESG reporting, regulatory monitoring, and technology landscape analyses.
    • Potential products: Sector-specific knowledge bases; alerting systems; compliance review co-pilots.
    • Assumptions/dependencies: Domain-specific corpora; task-specific evaluation frameworks; handling of proprietary content.

Notes on Feasibility, Assumptions, and Dependencies

  • Human-in-the-loop is essential: The paper’s workflow relies on calibration, majority voting, and escalation—especially for low-precision categories and full-text screening nuances.
  • Data access and quality: Reliable OCR and full-text access are critical; legal and licensing constraints apply.
  • Model selection per stage: Leverage strengths (e.g., GPT for extraction, BERT for screening) and monitor updates; performance can shift with model versions.
  • Governance and transparency: Disclose prompts, tools, and affected sections; adopt PRISMA-aligned reporting; maintain provenance logs and audit trails.
  • Ethical and policy considerations: Misinformation risks in summarization, bias detection limits, and meta-analysis challenges require stringent QA and validation.
  • Cost and infrastructure: Budget for API usage or invest in on-prem alternatives; ensure reproducible environments and robust MLOps for scaling.
  • Standard metrics and benchmarks: Use accuracy, precision/recall/F1, specificity, and balanced accuracy; maintain gold-standard datasets for revalidation over time.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.