K-QA: A Real-World Medical Q&A Benchmark (2401.14493v1)
Abstract: Ensuring the accuracy of responses provided by LLMs is crucial, particularly in clinical settings where incorrect information may directly impact patient health. To address this challenge, we construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health (an AI-driven clinical platform). We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements. Additionally, we formulate two NLI-based evaluation metrics approximating recall and precision: (1) comprehensiveness, measuring the percentage of essential clinical information in the generated answer and (2) hallucination rate, measuring the number of statements from the physician-curated response contradicted by the LLM answer. Finally, we use K-QA along with these metrics to evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes developed by the authors. Our findings indicate that in-context learning improves the comprehensiveness of the models, and augmented retrieval is effective in reducing hallucinations. We make K-QA available to to the community to spur research into medically accurate NLP applications.
- Overview of the medical question answering task at trec 2017 liveqa. In TREC, pages 1–12.
- Multilingual summarization with factual consistency evaluation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3562–3591, Toronto, Canada. Association for Computational Linguistics.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Overview of the medical question answering task at trec 2017 liveqa. In TREC 2017.
- Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019.
- A large annotated corpus for learning natural language inference. In Conference on Empirical Methods in Natural Language Processing.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Propsegment: A large-scale corpus for proposition-level segmentation and entailment recognition. arXiv preprint arXiv:2212.10750.
- The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop.
- Consumer health information and question answering: helping consumers find answers to their health-related information needs. Journal of the American Medical Informatics Association, 27(2):194–201.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
- Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research, 77:103–166.
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
- q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7856–7870, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146.
- Biomedical question answering: a survey of approaches and challenges. ACM Computing Surveys (CSUR), 55(2):1–36.
- Wice: Real-world entailment for claims in wikipedia. arXiv preprint arXiv:2303.01432.
- SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Evelyn Kai-Yan Liu. 2022. Low-resource neural machine translation: A case study of Cantonese. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 28–40, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
- The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing (TSLP), 4(2):4–es.
- OpenAI. 2023. Gpt-4 technical report.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pages 248–260. PMLR.
- Efficient benchmarking (of language models). ArXiv, abs/2308.11696.
- Measuring and narrowing the compositionality gap in language models. ArXiv, abs/2210.03350.
- Gpqa: A graduate-level google-proof q&a benchmark. ArXiv, abs/2311.12022.
- How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910.
- Question-driven summarization of answers to consumer health questions. Scientific Data, 7(1):322.
- Large language models encode clinical knowledge. Nature, 620:172 – 180.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):1–28.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- WHO. 1993. The ICD-10 classification of mental and behavioural disorders: diagnostic criteria for research, volume 2. World Health Organization.
- Medical exam question answering with large-scale reading comprehension. In Proceedings of the AAAI conference on artificial intelligence, volume 32.