AlpaCare:Instruction-tuned Large Language Models for Medical Application (2310.14558v5)

Published 23 Oct 2023 in cs.CL and cs.AI

Abstract: Instruction-finetuning (IFT) has become crucial in aligning LLMs with diverse human needs and has shown great potential in medical applications. However, previous studies mainly fine-tune LLMs on biomedical datasets with limited diversity, which often rely on benchmarks or narrow task scopes, and hence significantly limit the effectiveness on their medical instruction-following ability and generalizability. To bridge this gap, we propose creating a diverse, machine-generated medical IFT dataset, MedInstruct-52k, using GPT-4 and ChatGPT with a high-quality expert-curated seed set. We then fine-tune LLaMA-series models on the dataset to develop AlpaCare. Despite using a smaller domain-specific dataset than previous medical LLMs, AlpaCare not only demonstrates superior performance on medical applications, with up to 38.1% absolute gain over best baselines in medical free-form instruction evaluations, but also achieves 6.7% absolute gains averaged over multiple general domain benchmarks. Human evaluation further shows that AlpaCare consistently outperforms best baselines in terms of both correctness and helpfulness. We offer public access to our data, model, and codebase in https://github.com/XZhang97666/AlpaCare.

PDF Abstract

Overview of "AlpaCare: Instruction Fine-tuned LLMs for Medical Applications"

The paper "AlpaCare: Instruction Fine-tuned LLMs for Medical Applications" addresses a significant challenge in adapting LLMs for specialized tasks in the medical domain through the process of instruction fine-tuning (IFT). This paper introduces AlpaCare, a model fine-tuned on a particularly diverse dataset named MedInstruct-52k, distinguishing itself from prior approaches primarily utilizing narrowly scoped biomedical datasets. The authors emphasize the importance of dataset diversity to enhance LLM performance both in medical applications and generalization across broader tasks.

Methodology

To overcome limitations in existing medical applications of LLMs, the authors developed MedInstruct-52k, a diverse instruction-response dataset. This dataset is machine-generated utilizing state-of-the-art models such as GPT-4 and ChatGPT. The creation process begins with a high-quality seed set of clinician-curated tasks encompassing various medical topics, viewpoints, task types, and difficulty levels. This seed set facilitates the automatic generation of new tasks by GPT-4, emphasizing diversity in domain-specific user intent. The responses are then generated using ChatGPT, ensuring high-quality outputs. This diverse dataset is employed to fine-tune the LLaMA-series models, culminating in the AlpaCare model.

Experimental Results

AlpaCare was empirically evaluated across a spectrum of tasks:

Free-form Instruction Evaluation: AlpaCare demonstrated strong performance in medical instruction-following tasks, showing significant improvements over baseline models like Alpaca and various specialized medical LLMs. Notably, in tasks like iCliniq and MedInstruct-test, AlpaCare achieved exceptional gains of up to 38.1% over the best baselines.
Medical Benchmarks: Benchmarks such as MedQA, HeadQA, PubmedQA, and MedMCQA were employed to appraise the model's medical capacity. AlpaCare consistently showcased superior results, confirming its robustness in understanding and generating medical knowledge.
Generalization Tasks: Despite being tuned on a medical-specific dataset, AlpaCare maintained robust generalization capabilities, acquiring a 6.7% gain over multiple general domain benchmarks. This finding underscores the impact of dataset diversity.

Comparative Analysis

The paper includes an in-depth comparison with previous models tuned on larger yet less diversified datasets. Surprisingly, the model tuned on the smaller MedInstruct-52k surpassed its counterparts, suggesting that dataset diversity plays a critical role in effective instruction fine-tuning. Human evaluations also corroborated these findings, where AlpaCare's outputs were favored over other models in terms of both correctness (+12%) and helpfulness (+49%).

Implications and Future Directions

The authors highlight the potential of using synthetic yet diverse datasets for refining specialized applications of LLMs, particularly in fields requiring technical and sensitive knowledge such as healthcare. The implications of AlpaCare extend beyond immediate improvements in medical dialogue systems to promise advancements in patient aids, clinical decision support, and medical education tools.

Looking forward, the framework introduced for creating machine-generated IFT datasets could advance other domain-specific applications, potentially reducing costs associated with data collection and manual annotation. However, the paper also suggests exploring the integration of real-time data sources and further enhancing the models' capability to maintain performance on factual accuracy remains a critical avenue for research, given the tendency of LLMs to hallucinate.

In conclusion, the paper contributes significantly to the field of specialized LLM applications by demonstrating that a strategically crafted diverse dataset can dramatically improve both domain-specific performance and generalizability of fine-tuned models. The publicly available dataset and codebase also serve as valuable resources for ongoing research and development in medical AI.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Xinlu Zhang (15 papers)
Chenxin Tian (2 papers)
Xianjun Yang (37 papers)
Lichang Chen (30 papers)
Zekun Li (73 papers)
Linda Ruth Petzold (5 papers)

Citations (48)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - XZhang97666/AlpaCare (86 stars)

Tweets

https://twitter.com/PedramHosseini/status/1811108491144352233