MedLFQA: Enhancing the Factuality of Long-Form Medical Responses Using Olaph
The paper presents a new benchmark dataset, MedLFQA, tailored for long-form question answering (LFQA) in the biomedical domain. The dataset is specifically designed to evaluate the factuality of responses generated by LLMs. To improve the factuality of these responses, the authors introduce a novel framework named Olaph, which iteratively refines LLMs' outputs through a systematic preference optimization process.
Key Contributions
- MedLFQA Benchmark Dataset: The authors reconstructed existing biomedical LFQA datasets to create MedLFQA, which encompasses detailed question-answer pairs along with two types of crucial statements: Must Have (MH) and Nice to Have (NH). This reconstruction facilitates the automatic evaluation of model responses to ensure high factual accuracy.
- Olaph Framework: Olaph stands for "Optimizing LLMs' Answers with Preferences of mitigating Hallucination". It enhances factuality through a multi-step iterative training process, incorporating supervised fine-tuning (SFT) and direct preference optimization (DPO). The strongest responses are iteratively selected using comprehensive evaluation metrics covering word composition, semantic similarity, and factuality.
Methodology
MedLFQA Dataset Reconstruction
The MedLFQA dataset is built by integrating and reformatting several existing LFQA datasets, including LiveQA, MedicationQA, HealthSearchQA, and K-QA. The reconstruction involves not just providing answers but also generating MH and NH statements to precisely evaluate the factuality and relevance of the responses.
- Evaluation Metrics:
- Words Composition: Evaluates appropriate word usage.
- Semantic Similarity: Uses BLEURT and BERTScore to capture non-trivial semantic similarities.
- Factuality: Employs metrics like Hallucination and Comprehensiveness to measure the inclusion of crucial claims and the absence of false information.
Olaph Framework Details
- Supervised Fine-tuning (SFT): Initially, the model is fine-tuned on a smaller labeled dataset to identify the question-answering task.
- Preference Optimization: Through temperature scaling, multiple predictions are generated and filtered based on evaluation scores.
- Direct Preference Optimization (DPO): Trains the model to prefer higher-scored predictions iteratively, discouraging low-quality responses.
The iterative process ensures that the model improves its response quality and factuality step-by-step, reducing hallucinations and aligning its answers with medically accurate information.
Results
Zero-shot Evaluation
The authors evaluated various open-foundation biomedical LLMs, including LLaMA2, Mistral, Meditron, Self-BioRAG, and BioMistral. The results unveiled inconsistencies in model performance, particularly in terms of factuality. Base models like LLaMA2 and Mistral generally showed lower factuality compared to specialized biomedical models like Meditron and BioMistral.
Iterative Learning
Analyses of iterative learning processes reveal substantial improvements in the model's factuality, even matching the high standards set by GPT-4. Figure 2 in the paper highlights that through iterative DPO training, models like BioMistral 7B exhibit enhanced performance, achieving scores comparable to expert-annotated and GPT-4 responses.
Implications and Future Directions
The findings underscore the necessity for robust LFQA benchmarks in the biomedical domain, given the critical importance of factuality in medical responses. The MedLFQA dataset and Olaph framework collectively represent a significant step towards developing more reliable and accurate biomedical LLMs.
Future research should focus on:
- Enhanced Dataset Accuracy: Continued refinement of the MedLFQA dataset to eliminate potential inaccuracies and update outdated information.
- Scalability: Testing and validating the Olaph framework on models with varying parameter sizes to understand its broader applicability.
- Patient-specific Conversational Agents: Extending the framework to multi-turn conversations for comprehensive patient history understanding.
Conclusion
This paper introduces and validates a novel methodology to enhance the factual accuracy of long-form medical responses generated by LLMs. The MedLFQA dataset provides a rigorous benchmark for factuality, while the Olaph framework ensures iterative improvement in response quality. These contributions offer a promising direction for the development of reliable clinical conversation agents, potentially aiding medical professionals by providing accurate, detailed, and comprehensible medical information.