Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models (2405.01686v2)
Abstract: Meta-analyses statistically aggregate the findings of different randomized controlled trials (RCTs) to assess treatment effectiveness. Because this yields robust estimates of treatment effectiveness, results from meta-analyses are considered the strongest form of evidence. However, rigorous evidence syntheses are time-consuming and labor-intensive, requiring manual extraction of data from individual trials to be synthesized. Ideally, language technologies would permit fully automatic meta-analysis, on demand. This requires accurately extracting numerical results from individual trials, which has been beyond the capabilities of NLP models to date. In this work, we evaluate whether modern LLMs can reliably perform this task. We annotate (and release) a modest but granular evaluation dataset of clinical trial reports with numerical findings attached to interventions, comparators, and outcomes. Using this dataset, we evaluate the performance of seven LLMs applied zero-shot for the task of conditionally extracting numerical findings from trial reports. We find that massive LLMs that can accommodate lengthy inputs are tantalizingly close to realizing fully automatic meta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality). However, LLMs -- including ones trained on biomedical texts -- perform poorly when the outcome measures are complex and tallying the results requires inference. This work charts a path toward fully automatic meta-analysis of RCTs via LLMs, while also highlighting the limitations of existing models for this aim.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1998–2022, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. 10.18653/v1/2022.emnlp-main.130. URL https://aclanthology.org/2022.emnlp-main.130.
- Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS medicine, 7(9):e1000326, 2010.
- Introduction to meta-analysis. John Wiley & Sons, 2021.
- Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422, 2009.
- WHO Solidarity Trial Consortium. Repurposed antiviral drugs for covid-19—interim who solidarity trial results. New England journal of medicine, 384(6):497–511, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Evidence inference 2.0: More data, better models, 2020.
- Meta-analysis: principles and procedures. Bmj, 315(7121):1533–1537, 1997.
- Data extraction for evidence synthesis using a large language model: A proof-of-concept study. medRxiv, pages 2023–10, 2023.
- Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838, 2024.
- Remdesivir for the treatment of covid-19. Cochrane Database Syst Rev, pages CD014962–CD014962, 2023.
- Fixed-and random-effects models in meta-analysis. Psychological methods, 3(4):486, 1998.
- Standards for the conduct and reporting of new cochrane intervention reviews, reporting of protocols and the planning, conduct and reporting of updates. Methodological Expectations of Cochrane Intervention Reviews (MECIR), 2018.
- Caught in the quicksand of reasoning, far from agi summit: Evaluating llms’ mathematical and coding competency through ontology-guided interventions, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Zero-shot information extraction for clinical meta-analysis using large language models. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 396–405, 2023.
- Can large language models replace humans in the systematic review process? evaluating gpt-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. arXiv preprint arXiv:2310.17526, 2023.
- Exact: automatic extraction of clinical trial characteristics from journal publications. BMC medical informatics and decision making, 10:1–17, 2010.
- Csmed: Bridging the dataset gap in automated citation screening for systematic literature reviews. Advances in Neural Information Processing Systems, 36, 2024.
- Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
- Cumulative meta-analysis of clinical trials builds evidence for exemplary medical care. Journal of clinical epidemiology, 48(1):45–57, 1995.
- Inferring which medical treatments work from reports of clinical trials. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 3705–3717, 2019.
- Do we still need clinical language models? In Bobak J. Mortazavi, Tasmie Sarker, Andrew Beam, and Joyce C. Ho, editors, Proceedings of the Conference on Health, Inference, and Learning, volume 209 of Proceedings of Machine Learning Research, pages 578–597. PMLR, 22 Jun–24 Jun 2023. URL https://proceedings.mlr.press/v209/eric23a.html.
- New evidence pyramid. BMJ Evidence-Based Medicine, 21(4):125–127, 2016.
- Automatic data extraction to support meta-analysis statistical analysis: a case study on breast cancer. BMC Medical Informatics and Decision Making, 22(1):158, 2022.
- Care: Extracting experimental findings from clinical literature, 2023.
- Artificial intelligence to automate network meta-analyses: Four case studies to evaluate the potential application of large language models. PharmacoEconomics-Open, pages 1–16, 2024.
- Can llms master math? investigating large language models on math stack exchange, 2024.
- Statsmodels: econometric and statistical modeling with python. SciPy, 7:1, 2010.
- Large language models for scientific information extraction: An empirical study for virology. arXiv preprint arXiv:2401.10040, 2024.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
- An empirical evaluation of prompting strategies for large language models in zero-shot clinical natural language processing: Algorithm development and validation study. JMIR Medical Informatics, 12:e55318, 2024.
- Effect of remdesivir vs standard care on clinical status at 11 days in patients with moderate covid-19: a randomized clinical trial. Jama, 324(11):1048–1057, 2020.
- Automatic summarization of results from clinical trials. In 2011 IEEE International Conference on Bioinformatics and Biomedicine, pages 372–377. IEEE, 2011.
- How good are large language models for automated data extraction from randomized trials? medRxiv, pages 2024–02, 2024.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Who’s the best detective? large language models vs. traditional machine learning in detecting incoherent fourth grade math answers. Journal of Educational Computing Research, 61(8):187–218, 2024.
- Wolfgang Viechtbauer. Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3):1–48, 2010. 10.18637/jss.v036.i03.
- Revisiting relation extraction in the era of large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15566–15589, Toronto, Canada, July 2023a. Association for Computational Linguistics. 10.18653/v1/2023.acl-long.868. URL https://aclanthology.org/2023.acl-long.868.
- Jointly extracting interventions, outcomes, and findings from rct reports with llms. In Machine Learning for Healthcare Conference, pages 754–771. PMLR, 2023b.
- Active learning for biomedical citation screening. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 173–182, 2010.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023.
- Appraising the potential uses and harms of LLMs for medical systematic reviews. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10122–10139, Singapore, December 2023. Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.626. URL https://aclanthology.org/2023.emnlp-main.626.
- Hye Sun Yun (8 papers)
- David Pogrebitskiy (1 paper)
- Iain J. Marshall (17 papers)
- Byron C. Wallace (82 papers)