Large Language Models with Human-In-The-Loop Validation for Systematic Review Data Extraction (2501.11840v1)

Published 21 Jan 2025 in cs.HC

Abstract: Systematic reviews are time-consuming endeavors. Historically speaking, knowledgeable humans have had to screen and extract data from studies before it can be analyzed. However, LLMs hold promise to greatly accelerate this process. After a pilot study which showed great promise, we investigated the use of freely available LLMs for extracting data for systematic reviews. Using three different LLMs, we extracted 24 types of data, 9 explicitly stated variables and 15 derived categorical variables, from 112 studies that were included in a published scoping review. Overall we found that Gemini 1.5 Flash, Gemini 1.5 Pro, and Mistral Large 2 performed reasonably well, with 71.17%, 72.14%, and 62.43% of data extracted being consistent with human coding, respectively. While promising, these results highlight the dire need for a human-in-the-loop (HIL) process for AI-assisted data extraction. As a result, we present a free, open-source program we developed (AIDE) to facilitate user-friendly, HIL data extraction with LLMs.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper investigates using LLMs with human-in-the-loop validation to automate data extraction for systematic reviews, evaluating the accuracy of models like Gemini 1.5 Pro which achieved 72.14% overall agreement in one study.
Accuracy in LLM-based data extraction for systematic reviews is highly dependent on the specific model used, prompt design, and the type of data being extracted (explicit vs. derived categorical).
Despite the potential for automation, a human-in-the-loop process is crucial for verifying and ensuring the overall accuracy and reliability of data extracted by LLMs for systematic reviews.

This paper explores the use of LLMs with Human-In-The-Loop (HIL) verification to enhance data extraction for systematic reviews. The authors conducted a thorough investigation using multiple freely available LLMs, mainly focusing on their ability to automate and potentially accelerate data extraction processes that were traditionally labor-intensive.

Research Context and Objectives

Systematic reviews are quintessential for synthesizing research but are notoriously laborious, involving extensive data extraction and coding manually. The arrival of LLMs offers a promising avenue to automate parts of this cumbersome process. This paper aims to evaluate the efficacy of various LLMs in extracting both explicitly stated and derived categorical data from a sample of research studies. The paper is primarily sectioned into three parts: a pilot paper, a main paper with expanded datasets, and the introduction of a software tool, AIDE, to aid and streamline the process.

Pilot Study

Methodology

LLMs Used: Claude, ChatPDF, GPT-4.
Data Set: 8 studies focused on virtual characters in K-12 education.
Variables: 9 explicit and 15 derived categorical variables.

Findings

Explicit Variable Extraction: Claude achieved 93.06% agreement with human coders and exhibited higher accuracy compared to GPT-4 (91.67%) and ChatPDF (75%).
Categorical Variable Extraction: Again, Claude led with 80% agreement.
Overall Accuracy: Claude showed the highest overall agreement at 84.90%.

Main Study

Methodology

LLMs Used: Gemini 1.5. Flash, Gemini 1.5. Pro, Mistral Large 2, all accessed via APIs for efficiency in handling larger data samples.
Data Set: 112 studies from a published scoping review.
Variables: Same as pilot paper with enhanced data extraction scripts using Python.

Findings

Explicit Variables, Exact Match: Gemini 1.5. Pro scored 83.33% agreement, with Gemini 1.5. Flash (81.75%) and Mistral Large 2 (71.73%).
Derived Categorical Variables, Exact Match: Gemini 1.5. Pro achieved 65.42%, with Mistral trailing at 56.85%.
Overall Accuracy: Gemini 1.5. Pro led with 72.14% agreement in Exact Match analysis.

Key Insights

Accuracy Dependence: The accuracy of LLMs in extracting data hinges on both the prompt structure and the data type.
LLM Sensitivity: There's considerable sensitivity in LLMs to variations in input formats and prompts, affecting performance.
HIL Necessity: Despite LLM capabilities, a human-in-the-loop remains indispensable for ensuring accuracy and reliability of data extraction.

AIDE: Software Introduction

Purpose: AIDE (AI-assisted Data Extraction) is developed to endow researchers with capabilities to harness LLMs while maintaining necessary human oversight.
Functionality: It features a graphic user interface, easy API integration, and automatic verification processes that allow researchers to call for human validation efficiently without getting enmeshed in exhaustive and fatiguing verification processes.
Availability: AIDE is open-source, built to be accessible and integrate free API models, ensuring cost-efficient utilization of LLMs.

Conclusion and Future Directions

The paper underscores that while LLMs present a significant potential to facilitate research synthesis through partial automation of the data extraction, a well-defined HIL framework remains critical. Future advances may focus on improving the accuracy through better prompt optimization and expanding the applicability to a wider variety of studies and variables. Moreover, quantifiable evidence of resource savings due to tools like AIDE is encouraged for a holistic evaluation of their utility in the field.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (3)

YouTube

Show All Videos