A Reproducibility and Generalizability Study of Large Language Models for Query Generation (2411.14914v1)

Published 22 Nov 2024 in cs.IR

Abstract: Systematic literature reviews (SLRs) are a cornerstone of academic research, yet they are often labour-intensive and time-consuming due to the detailed literature curation process. The advent of generative AI and LLMs promises to revolutionize this process by assisting researchers in several tedious tasks, one of them being the generation of effective Boolean queries that will select the publications to consider including in a review. This paper presents an extensive study of Boolean query generation using LLMs for systematic reviews, reproducing and extending the work of Wang et al. and Alaniz et al. Our study investigates the replicability and reliability of results achieved using ChatGPT and compares its performance with open-source alternatives like Mistral and Zephyr to provide a more comprehensive analysis of LLMs for query generation. Therefore, we implemented a pipeline, which automatically creates a Boolean query for a given review topic by using a previously defined LLM, retrieves all documents for this query from the PubMed database and then evaluates the results. With this pipeline we first assess whether the results obtained using ChatGPT for query generation are reproducible and consistent. We then generalize our results by analyzing and evaluating open-source models and evaluating their efficacy in generating Boolean queries. Finally, we conduct a failure analysis to identify and discuss the limitations and shortcomings of using LLMs for Boolean query generation. This examination helps to understand the gaps and potential areas for improvement in the application of LLMs to information retrieval tasks. Our findings highlight the strengths, limitations, and potential of LLMs in the domain of information retrieval and literature review automation.

Authors (5)

Moritz Staudinger (2 papers)
Wojciech Kusa (16 papers)
Florina Piroi (5 papers)
Aldo Lipani (27 papers)
Allan Hanbury (45 papers)

Summary

Analysis of "A Reproducibility and Generalizability Study of LLMs for Query Generation"

The paper "A Reproducibility and Generalizability Study of LLMs for Query Generation" elaborates on the application of LLMs for the task of generating Boolean queries, specifically in the context of systematic literature reviews (SLRs). Encompassing a structured comparative analysis, the authors aim to reproduce and generalize the findings of previous studies from Wang et al. and Alaniz et al., focusing on reproducibility and generalization challenges within LLM-driven query generation.

Key Contributions

The researchers undertake the task of generating and evaluating Boolean queries using an automated pipeline consisting of multiple LLMs, both closed-source like GPT-3.5 and GPT-4, and open-source models such as Mistral and Zephyr. The paper is carried out on two well-known datasets, CLEF TAR and Seed, to benchmark the proposed approach against traditional methods.

Their primary contributions can be distilled in several aspects:

Performance Evaluation: The paper critically assesses LLMs against established baselines, exploring precision, recall, and F1-score metrics. The results reveal variability in performance across various LLMs, highlighting the superior precision capabilities of commercial models like GPT-4 in generating Boolean queries for certain benchmarks.
Reproducibility Issues: The authors note significant challenges in reproducing the results of prior work, citing inconsistency in generated queries across multiple runs due to LLMs' inherent stochastic output variations. This variability poses critical reproducibility issues, underscoring the necessity for research transparency and detailed experimental documentation.
Data and Methodological Challenges: The work dissects the datasets used in precedential studies, identifying duplication and other issues within the Seed collection that hinder reproducibility. Furthermore, details on query types were lacking, showcasing a need for meticulous reporting in LLM studies.
Model Comparison: By evaluating several models, the authors provide a comprehensive view of the efficacy of open-source models in comparison to proprietary models. The open-source models demonstrate potential in terms of query length and inclusion of search fields but lag behind commercial models in overall metrics performance.
Guided Query Generation: The guided approach employed did not consistently produce robust queries across all models, indicating the nuanced role of inputs and prompting syntax when engaging LLMs for specific domain tasks such as SLR query formulation.

Implications

This paper provides insights into the growing application of LLMs in automating labor-intensive research processes like Boolean query generation for SLRs, emphasizing the importance of reproducible and reliable models. The findings suggest that while LLMs present a promising tool with significant precision advantages, challenges remain in their stochastic nature, which impacts reproducibility and generalizability.

Given the variability in outputs and the cost implications of running multiple API calls, LLMs in their current form require refined tuning and methodological enhancements to be considered a mainstream solution for systematic reviews. Future directions may involve training specialized LLM versions or integrating retrieval-augmented generation techniques to bolster query validation processes.

On a broader scale, the paper advocates for more transparent and comprehensive reporting standards in LLM research, which would enable peers to replicate and extend previous works effectively. Moreover, as the field evolves, the demand for robust interpretability frameworks for LLM outputs becomes ever more paramount to consolidate trust and applicability in academic settings.

In conclusion, the paper affirms the potential of LLMs while candidly addressing the pressing challenges within the field of reproducibility in AI-driven information retrieval applications. This research contributes to the discourse on LLM efficacy in structured academic settings, drawing attention to areas requiring further investigative rigor.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1860892139615551675

https://twitter.com/rohanpaul_ai/status/1865378370726576443

https://twitter.com/GptMaestro/status/1863657939275485595