KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques (2403.05881v3)
Abstract: LLMs have demonstrated impressive generative capabilities with the potential to innovate in medicine. However, the application of LLMs in real clinical settings remains challenging due to the lack of factual consistency in the generated content. In this work, we develop an augmented LLM framework, KG-Rank, which leverages a medical knowledge graph (KG) along with ranking and re-ranking techniques, to improve the factuality of long-form question answering (QA) in the medical domain. Specifically, when receiving a question, KG-Rank automatically identifies medical entities within the question and retrieves the related triples from the medical KG to gather factual information. Subsequently, KG-Rank innovatively applies multiple ranking techniques to refine the ordering of these triples, providing more relevant and precise information for LLM inference. To the best of our knowledge, KG-Rank is the first application of KG combined with ranking models in medical QA specifically for generating long answers. Evaluation on four selected medical QA datasets demonstrates that KG-Rank achieves an improvement of over 18% in ROUGE-L score. Additionally, we extend KG-Rank to open domains, including law, business, music, and history, where it realizes a 14% improvement in ROUGE-L score, indicating the effectiveness and great potential of KG-Rank.
- Overview of the medical question answering task at trec 2017 liveqa. In TREC, pages 1–12.
- Bridging the gap between consumers’ medication questions and trusted answers. In MedInfo, pages 25–29.
- Olivier Bodenreider. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270.
- Jaime G. Carbonell and Jade Goldstein-Stewart. 1998. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
- Almanac: Retrieval-augmented language models for clinical medicine.
- Enhancing diagnostic accuracy through multi-agent conversations: Using large language models to mitigate cognitive bias.
- Neural natural language processing for unstructured data in electronic health records: A review. Computer Science Review, 46:100511.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Chatent: Augmented large language model for expert knowledge retrieval in otolaryngology-head and neck surgery. medRxiv, pages 2023–08.
- Expertqa: Expert-curated questions and attributed answers.
- Umlsbert: Clinical domain knowledge augmentation of contextual embeddings using the unified medical language system metathesaurus.
- OpenAI. 2023. Gpt-4 technical report.
- Bleurt: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696.
- Mintaka: A complex, natural, and multilingual dataset for end-to-end question answering.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Faithful ai in medicine: A systematic review with large language models and beyond. Medrxiv: the Preprint Server for Health Sciences.
- Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.
- Large language models in health care: Development, applications, and challenges. Health Care Science.
- Ascle: A python natural language processing toolkit for medical text generation.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Moverscore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv preprint arXiv:1909.02622.