The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews (2404.15667v4)
Abstract: Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if LLMs can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening.
- Automated text simplification: a survey. ACM Computing Surveys (CSUR) 54, 2 (2021), 1–36.
- quanteda: An r package for the quantitative analysis of textual data. Journal of Open Source Software 3, 30 (2018), 774.
- Artificial Intelligence for Literature Reviews: Opportunities and Challenges, Feb. 2024.
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Research Screener: A machine learning tool to semi-automate abstract screening for systematic reviews. Systematic Reviews 10, 1 (Apr. 2021), 93.
- What’s so simple about simplified texts? a computational and psycholinguistic investigation of text comprehension and text processing. Reading in a Foreign Language 26, 1 (2014), 92–113.
- A systematic literature review of literature reviews in software testing. Information and Software Technology 80 (2016), 195–216.
- Technology-assisted title and abstract screening for systematic reviews: A retrospective evaluation of the Abstrackr machine learning tool. Systematic Reviews 7, 1 (Mar. 2018), 45.
- Developing a test of scientific literacy skills (tosls): Measuring undergraduates’ evaluation of scientific information and arguments. CBE—Life Sciences Education 11, 4 (2012), 364–377.
- Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study. Journal of Medical Internet Research 26, 1 (Jan. 2024), e48996.
- Large Language Models for Software Engineering: A Systematic Literature Review, Sept. 2023. arXiv:2308.10620 [cs].
- Dataset for paper: The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews, Apr. 2024. DOI = 10.5281/zenodo.11028876, URL = https://zenodo.org/records/11028876.
- Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Research Synthesis Methods (Mar. 2024).
- Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, Institute for Simulation and Training, University of Central Florida, 1975.
- Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01 (2007).
- Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems (2022), S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35, Curran Associates, Inc., pp. 22199–22213.
- Time Pressure in Software Engineering: A Systematic Review. Information and Software Technology 121 (May 2020), 106257.
- Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Systematic Reviews 8, 1 (July 2019), 163.
- Mc Laughlin, G. H. Smog grading-a new readability formula. Journal of reading 12, 8 (1969), 639–646.
- Using text mining for study identification in systematic reviews: a systematic review of current approaches. Systematic reviews 4, 1 (2015), 1–22.
- Paving the way for mature secondary research: the seven types of literature review. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2022), pp. 1632–1636.
- Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Systematic Reviews 4, 1 (June 2015), 80.
- Bio-SIEVE: Exploring Instruction Tuning Large Language Models for Systematic Review Automation, Aug. 2023.
- Are students representatives of professionals in software engineering experiments? In Proceedings of the 37th International Conference on Software Engineering-Volume 1 (2015), pp. 666–676.
- Štajner, S. Automatic text simplification for social good: Progress and challenges. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (2021), 2637–2652.
- Automation of systematic literature reviews: A systematic literature review. Information and Software Technology 136 (Aug. 2021), 106589.
- Zero-shot Generative Large Language Models for Systematic Review Screening Automation, Jan. 2024.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jan. 2023. arXiv:2201.11903 [cs].
- Wilkins, D. Automated title and abstract screening for scoping reviews using the GPT-4 Large Language Model, Nov. 2023.
- Successful combination of database search and snowballing for identification of primary studies in systematic literature studies. Information and Software Technology 147 (2022), 106908.
- A Survey of Large Language Models, May 2023. arXiv:2303.18223 [cs].
- Aleksi Huotala (2 papers)
- Miikka Kuutila (17 papers)
- Paul Ralph (27 papers)
- Mika Mäntylä (33 papers)