A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations (2305.16326v4)

Published 10 May 2023 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: The biomedical literature is rapidly expanding, posing a significant challenge for manual curation and knowledge discovery. Biomedical Natural Language Processing (BioNLP) has emerged as a powerful solution, enabling the automated extraction of information and knowledge from this extensive literature. Recent attention has been directed towards LLMs due to their impressive performance. However, there remains a critical gap in understanding the effectiveness of LLMs in BioNLP tasks and their broader implications for method development and downstream users. Currently, there is a lack of baseline performance data, benchmarks, and practical recommendations for using LLMs in the biomedical domain. To address this gap, we present a systematic evaluation of four representative LLMs: GPT-3.5 and GPT-4 (closed-source), LLaMA 2 (open-sourced), and PMC LLaMA (domain-specific) across 12 BioNLP datasets covering six applications (named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification). The evaluation is conducted under four settings: zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning. We compare these models against state-of-the-art (SOTA) approaches that fine-tune (domain-specific) BERT or BART models, which are well-established methods in BioNLP tasks. The evaluation covers both quantitative and qualitative evaluations, where the latter involves manually reviewing collectively hundreds of thousands of LLM outputs for inconsistencies, missing information, and hallucinations in extractive and classification tasks. The qualitative review also examines accuracy, 1 completeness, and readability in text summarization tasks. Additionally, a cost analysis of closed-source GPT models is conducted.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (21)

Qingyu Chen (57 papers)
Jingcheng Du (13 papers)
Yan Hu (75 papers)
Vipina Kuttichi Keloth (4 papers)
Xueqing Peng (12 papers)
Kalpana Raja (3 papers)
Rui Zhang (1138 papers)
Zhiyong Lu (113 papers)
Hua Xu (78 papers)
Qianqian Xie (60 papers)
Qiao Jin (74 papers)
Aidan Gilson (6 papers)
Maxwell B. Singer (2 papers)
Xuguang Ai (7 papers)
Po-Ting Lai (14 papers)
Zhizheng Wang (10 papers)
Jiming Huang (1 paper)
Huan He (45 papers)
Fongci Lin (3 papers)
W. Jim Zheng (4 papers)

Citations (37)

View on Semantic Scholar

GitHub

GitHub - qingyu-qc/gpt_bionlp_benchmark (25 stars)

A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations (2305.16326v4)

Related Papers

GitHub