Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A systematic evaluation of large language models for biomedical natural language processing: benchmarks, baselines, and recommendations (2305.16326v4)

Published 10 May 2023 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: The biomedical literature is rapidly expanding, posing a significant challenge for manual curation and knowledge discovery. Biomedical Natural Language Processing (BioNLP) has emerged as a powerful solution, enabling the automated extraction of information and knowledge from this extensive literature. Recent attention has been directed towards LLMs due to their impressive performance. However, there remains a critical gap in understanding the effectiveness of LLMs in BioNLP tasks and their broader implications for method development and downstream users. Currently, there is a lack of baseline performance data, benchmarks, and practical recommendations for using LLMs in the biomedical domain. To address this gap, we present a systematic evaluation of four representative LLMs: GPT-3.5 and GPT-4 (closed-source), LLaMA 2 (open-sourced), and PMC LLaMA (domain-specific) across 12 BioNLP datasets covering six applications (named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification). The evaluation is conducted under four settings: zero-shot, static few-shot, dynamic K-nearest few-shot, and fine-tuning. We compare these models against state-of-the-art (SOTA) approaches that fine-tune (domain-specific) BERT or BART models, which are well-established methods in BioNLP tasks. The evaluation covers both quantitative and qualitative evaluations, where the latter involves manually reviewing collectively hundreds of thousands of LLM outputs for inconsistencies, missing information, and hallucinations in extractive and classification tasks. The qualitative review also examines accuracy, 1 completeness, and readability in text summarization tasks. Additionally, a cost analysis of closed-source GPT models is conducted.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (21)
  1. Qingyu Chen (57 papers)
  2. Jingcheng Du (13 papers)
  3. Yan Hu (75 papers)
  4. Vipina Kuttichi Keloth (4 papers)
  5. Xueqing Peng (12 papers)
  6. Kalpana Raja (3 papers)
  7. Rui Zhang (1138 papers)
  8. Zhiyong Lu (113 papers)
  9. Hua Xu (78 papers)
  10. Qianqian Xie (60 papers)
  11. Qiao Jin (74 papers)
  12. Aidan Gilson (6 papers)
  13. Maxwell B. Singer (2 papers)
  14. Xuguang Ai (7 papers)
  15. Po-Ting Lai (14 papers)
  16. Zhizheng Wang (10 papers)
  17. Jiming Huang (1 paper)
  18. Huan He (45 papers)
  19. Fongci Lin (3 papers)
  20. W. Jim Zheng (4 papers)
Citations (37)
Github Logo Streamline Icon: https://streamlinehq.com