Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 191 tok/s Pro
2000 character limit reached

Large Language Models with Human-In-The-Loop Validation for Systematic Review Data Extraction (2501.11840v1)

Published 21 Jan 2025 in cs.HC

Abstract: Systematic reviews are time-consuming endeavors. Historically speaking, knowledgeable humans have had to screen and extract data from studies before it can be analyzed. However, LLMs hold promise to greatly accelerate this process. After a pilot study which showed great promise, we investigated the use of freely available LLMs for extracting data for systematic reviews. Using three different LLMs, we extracted 24 types of data, 9 explicitly stated variables and 15 derived categorical variables, from 112 studies that were included in a published scoping review. Overall we found that Gemini 1.5 Flash, Gemini 1.5 Pro, and Mistral Large 2 performed reasonably well, with 71.17%, 72.14%, and 62.43% of data extracted being consistent with human coding, respectively. While promising, these results highlight the dire need for a human-in-the-loop (HIL) process for AI-assisted data extraction. As a result, we present a free, open-source program we developed (AIDE) to facilitate user-friendly, HIL data extraction with LLMs.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper investigates using LLMs with human-in-the-loop validation to automate data extraction for systematic reviews, evaluating the accuracy of models like Gemini 1.5 Pro which achieved 72.14% overall agreement in one study.
  • Accuracy in LLM-based data extraction for systematic reviews is highly dependent on the specific model used, prompt design, and the type of data being extracted (explicit vs. derived categorical).
  • Despite the potential for automation, a human-in-the-loop process is crucial for verifying and ensuring the overall accuracy and reliability of data extracted by LLMs for systematic reviews.

This paper explores the use of LLMs with Human-In-The-Loop (HIL) verification to enhance data extraction for systematic reviews. The authors conducted a thorough investigation using multiple freely available LLMs, mainly focusing on their ability to automate and potentially accelerate data extraction processes that were traditionally labor-intensive.

Research Context and Objectives

Systematic reviews are quintessential for synthesizing research but are notoriously laborious, involving extensive data extraction and coding manually. The arrival of LLMs offers a promising avenue to automate parts of this cumbersome process. This paper aims to evaluate the efficacy of various LLMs in extracting both explicitly stated and derived categorical data from a sample of research studies. The paper is primarily sectioned into three parts: a pilot paper, a main paper with expanded datasets, and the introduction of a software tool, AIDE, to aid and streamline the process.

Pilot Study

Methodology

  • LLMs Used: Claude, ChatPDF, GPT-4.
  • Data Set: 8 studies focused on virtual characters in K-12 education.
  • Variables: 9 explicit and 15 derived categorical variables.

Findings

  • Explicit Variable Extraction: Claude achieved 93.06% agreement with human coders and exhibited higher accuracy compared to GPT-4 (91.67%) and ChatPDF (75%).
  • Categorical Variable Extraction: Again, Claude led with 80% agreement.
  • Overall Accuracy: Claude showed the highest overall agreement at 84.90%.

Main Study

Methodology

  • LLMs Used: Gemini 1.5. Flash, Gemini 1.5. Pro, Mistral Large 2, all accessed via APIs for efficiency in handling larger data samples.
  • Data Set: 112 studies from a published scoping review.
  • Variables: Same as pilot paper with enhanced data extraction scripts using Python.

Findings

  • Explicit Variables, Exact Match: Gemini 1.5. Pro scored 83.33% agreement, with Gemini 1.5. Flash (81.75%) and Mistral Large 2 (71.73%).
  • Derived Categorical Variables, Exact Match: Gemini 1.5. Pro achieved 65.42%, with Mistral trailing at 56.85%.
  • Overall Accuracy: Gemini 1.5. Pro led with 72.14% agreement in Exact Match analysis.

Key Insights

  1. Accuracy Dependence: The accuracy of LLMs in extracting data hinges on both the prompt structure and the data type.
  2. LLM Sensitivity: There's considerable sensitivity in LLMs to variations in input formats and prompts, affecting performance.
  3. HIL Necessity: Despite LLM capabilities, a human-in-the-loop remains indispensable for ensuring accuracy and reliability of data extraction.

AIDE: Software Introduction

  • Purpose: AIDE (AI-assisted Data Extraction) is developed to endow researchers with capabilities to harness LLMs while maintaining necessary human oversight.
  • Functionality: It features a graphic user interface, easy API integration, and automatic verification processes that allow researchers to call for human validation efficiently without getting enmeshed in exhaustive and fatiguing verification processes.
  • Availability: AIDE is open-source, built to be accessible and integrate free API models, ensuring cost-efficient utilization of LLMs.

Conclusion and Future Directions

The paper underscores that while LLMs present a significant potential to facilitate research synthesis through partial automation of the data extraction, a well-defined HIL framework remains critical. Future advances may focus on improving the accuracy through better prompt optimization and expanding the applicability to a wider variety of studies and variables. Moreover, quantifiable evidence of resource savings due to tools like AIDE is encouraged for a holistic evaluation of their utility in the field.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com