Aligning Large Language Models for Clinical Tasks (2309.02884v2)

Published 6 Sep 2023 in cs.CL

Abstract: LLMs have demonstrated remarkable adaptability, showcasing their capacity to excel in tasks for which they were not explicitly trained. However, despite their impressive NLP capabilities, effective alignment of LLMs remains a crucial challenge when deploying them for specific clinical applications. The ability to generate responses with factually accurate content and to engage in non-trivial reasoning steps are crucial for the LLMs to be eligible for applications in clinical medicine. Employing a combination of techniques including instruction-tuning and in-prompt strategies like few-shot and chain-of-thought prompting has significantly enhanced the performance of LLMs. Our proposed alignment strategy for medical question-answering, known as 'expand-guess-refine', offers a parameter and data-efficient solution. A preliminary analysis of this method demonstrated outstanding performance, achieving a score of 70.63% on a subset of questions sourced from the USMLE dataset.

PDF Abstract

The paper explores the alignment of LLMs for clinical tasks, emphasizing the necessity of factual accuracy and reasoning capabilities for medical applications. It introduces an alignment strategy named 'expand-guess-refine' for medical question-answering, demonstrating data-efficient performance. The paper highlights that despite the remarkable NLP (Natural Language Processing) capabilities of LLMs, their deployment in clinical medicine requires careful alignment to ensure the extraction of pertinent information and reasoned analysis, while also mitigating hallucination and ensuring ethical compliance.

The authors discuss instruction tuning as a method to train models to follow instructions, which improves the model's truthfulness and structure. Finetuning, which adjusts the weights of a pre-trained model via a supervised dataset, is also discussed. The paper mentions the disadvantages of finetuning, such as the need for a fresh dataset for each new task and the intensive computation required. Few-shot learning is presented as an alternative, using a few demonstrations to guide the model. Chain-of-thought (CoT) prompting is introduced as a method to enhance the model's reasoning by simulating a step-by-step thinking process. The paper notes that while in-context prompting strategies like few-shot prompting and CoT improve reasoning, their performance may not reach that of finetuned models.

The paper mentions studies on training smaller-scale LLMs on scientific and biomedical corpora, and also mentions that larger models with enhanced reasoning, complemented by alignment methodologies, outperform smaller models trained on datasets. The PubMedGPT 2.7B model, which achieved a score of 50.3\% on the USMLE dataset, challenges this notion.

The paper discusses the USMLE dataset, a subset of the MedQA dataset, which is used to evaluate the performance of LLMs in medical question answering. The paper references a paper by Liévin et al. that found that a code-finetuned code-davinci-002 175 B parameter GPT-3.5 series model scored 53.1\% when combined with retrieval augmentation and multiple-prompting on the USMLE dataset [jin_what_2020-1]. They used a BM25 retriever made of Wikipedia articles for grounding [lievin_can_2023-1]. The paper indicated that the GPT-3.5 model can leverage implicit knowledge and reasoning in USMLE question-answering tasks. It also demonstrated that increasing inference-time compute by sampling multiple generations through CoT can surpass the pass-mark for USMLE.

The MultiMedQA dataset was created by aggregating medical question-answering (QA) datasets with a new dataset of commonly searched health questions [singhal_large_2022-3]. The authors also created an instruction prompt tuning technique to align LLMs to medical domain tasks. Their model, built on an instruction-tuned variant of the 540 B parameter PaLM model (Flan-PaLM), achieved an accuracy of 67.6\% on the USMLE dataset [chowdhery_palm_2022-3]. This was achieved using few-shot prompting, chain-of-thought, and self-consistency [wang_self-consistency_2023].

The paper notes that the GPT-4 base model without any finetuning scored 83.76\% on the USMLE dataset on zero-shot prompting [nori_capabilities_nodate-1]. It also discusses the Med-PaLM 2 model, which uses medical domain-specific finetuning and a novel prompting strategy called ensemble refinement. The ensemble refinement technique uses chain-of-thought prompting, self-consistency, and self-refinement mechanisms [madaan_self-refine_2023]. The model achieved state-of-the-art performance on the USMLE dataset with 85.4\% accuracy.

The paper addresses the issue of factual inconsistency in LLMs and suggests integrating a non-parametric memory with the LLM to create a hybrid model. It defines an explicit knowledgebase and augmenting the LLM generation with the retrieved information from the non-parametric memory makes it possible to examine the source of the information of the LLM generated output [lewis_retrieval-augmented_2021]. The paper mentions studies that showcase the performance of Retrieval Augmented Generation (RAG) models [lewis_retrieval-augmented_2021, guu_realm_2020-3]. These studies have showcased superior performance on open-domain question-answering benchmarks [joshi_triviaqa_2017-5, kwiatkowski_natural_2019-1, berant_semantic_2013, baudis_modeling_2015-1].

The authors propose a retrieval augmented generation strategy using dense vectors and a prompting strategy called 'expand-guess-refine'. This strategy operates in a zero-shot manner, without model finetuning.

The paper used OpenAI's \verb|gpt-3.5-turbo| 175B parameter model for the preliminary evaluation. The vector database was compiled by segmenting the text of 18 medical books, which were converted into text via optical character recognition. The whole text corpus of all the books was split into chunks of a maximum of 3000 characters with 1000-character overlaps using recursive text splitter (RTS) method. RTS splits the text based on a parameterized list of characters. The splits were subsequently embedded using 1536-dimensional OpenAI text-embedding-ada-002 embedding model and stored in FAISS vector database [johnson_billion-scale_2017-2].

The 'expand-guess-refine' prompting strategy consists of three components:

Expand: Reshapes the context by expanding and elaborating on important points and rephrases the question into a direct query format.
Guess: Predicts the response to the expanded question before seeing the answer choices, with the assistance of top-k retrieved documents sourced from the vector database.
Refine: Compiles a prompt using the transformed context, the guess, and the actual options provided. The LLM is tasked with selecting the most appropriate answer from the provided answer choices.

In a preliminary analysis conducted on the first 100 questions and 50 random questions from the USMLE development data split, the model achieved an accuracy of 70.63\%, while the accuracy achieved by chatGPT was 59.44\%. The improvement achieved by the expand-guess-refine model was statistically significant (p-value 0.031) for the two-sample test for equality of proportions.

The paper concludes by emphasizing the importance of enabling seamless updating of LLM knowledgebases and the importance of explainable knowledge. It also suggests that augmenting the implicit knowledgebase of the LLM with a high-quality task-specific non-parametric knowledgebase can significantly improve performance.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Supun Manathunga (2 papers)
Isuru Hettigoda (1 paper)

Citations (10)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos