Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Improving Health Question Answering with Reliable and Time-Aware Evidence Retrieval (2404.08359v1)

Published 12 Apr 2024 in cs.CL, cs.AI, and cs.IR

Abstract: In today's digital world, seeking answers to health questions on the Internet is a common practice. However, existing question answering (QA) systems often rely on using pre-selected and annotated evidence documents, thus making them inadequate for addressing novel questions. Our study focuses on the open-domain QA setting, where the key challenge is to first uncover relevant evidence in large knowledge bases. By utilizing the common retrieve-then-read QA pipeline and PubMed as a trustworthy collection of medical research documents, we answer health questions from three diverse datasets. We modify different retrieval settings to observe their influence on the QA pipeline's performance, including the number of retrieved documents, sentence selection process, the publication year of articles, and their number of citations. Our results reveal that cutting down on the amount of retrieved documents and favoring more recent and highly cited documents can improve the final macro F1 score up to 10%. We discuss the results, highlight interesting examples, and outline challenges for future research, like managing evidence disagreement and crafting user-friendly explanations.

Enhancing Performance of Health Question Answering Systems through Optimal Evidence Retrieval Strategies

Introduction to Health Question Answering Systems

Health Question Answering (QA) Systems leverage vast collections of documented medical research to provide answers to health-related inquiries. Given the abundance of medical literature and the rapid evolution of clinical recommendations, sourcing the most relevant and up-to-date evidence is pivotal. Traditional QA systems, however, often fall short when faced with novel queries, primarily due to their reliance on predefined evidence documents. This paper embarks on refining the open-domain QA system – a more realistic approach that necessitates the retrieval of pertinent evidence from extensive document corpora before formulating an answer. By exploring various retrieval settings, including the volume of documents retrieved and the incorporation of metadata such as publication year and citation count, this research aims to fine-tune the QA system's performance within the domain of health.

The Intricacies of Open-Domain QA Systems

Open-domain QA systems, characterized by their ability to query extensive document collections, primarily consist of two components: the retriever and the reader. The retriever's role is to source documents that potentially contain the answer, while the reader extracts and formulates this answer based on the evidence provided by the retriever. This paper posits the hypothesis that the performance of the QA system predominantly hinges on the effectiveness of the retriever component. The premise being, the quality and relevance of the documents retrieved play a crucial role in the accuracy of the final answer provided.

To validate this, experiments were crafted around PubMed's collection of medical research documents, testing various configurations of the retrieve-then-read pipeline. These configurations included adjustments in the number of documents and sentences retrieved, as well as considerations for the publication year and citation count of these documents. Findings from this research indicate a potential improvement in macro F1 scores by up to 10% through the optimization of retrieval strategies alone.

Methodological Approach

The paper embarked on a series of experiments to evaluate the impact of different evidence retrieval configurations on the health QA system's accuracy. Three health-related question datasets were employed, using PubMed as the source for evidence retrieval. By fixing the reader component and varying the retrieval strategies, the research isolated the effects of retrieval adjustments on system performance.

Key experiments included varying the number of documents retrieved and extracting the top sentences from these documents for QA processing. Additionally, the paper delved into the influence of document quality – assessed by recency and citation count – on QA accuracy. Performance metrics such as precision, recall, and F1 score were used to evaluate the system's effectiveness across different settings.

Insights and Implications

The investigation revealed several key insights pertinent to the optimization of open-domain health QA systems:

  • Reducing the volume of documents retrieved tends to enhance QA performance, suggesting a higher signal-to-noise ratio with fewer selected documents.
  • Extracting top relevant sentences from selected documents further refines the quality of evidence, although the ideal number of sentences varies between datasets.
  • Favoring recent and highly cited documents as sources of evidence generally leads to improvements in QA accuracy. This underscores the value of considering document metadata in the retrieval process.

Future Directions

Building on these findings, future research avenues could explore the integration of evidence strength and conflict resolution mechanisms within the QA pipeline. The adoption of models that account for the varying levels of evidence strength across different types of medical studies may offer a more nuanced approach to evidence retrieval. Furthermore, new strategies for handling evidence disagreement and enhancing the interpretability of answers could significantly improve the utility of health QA systems for end-users.

Concluding Remarks

This paper contributes to the ongoing refinement of health question answering systems by highlighting the critical role of evidence retrieval strategies in optimizing system performance. By systematically analyzing the impact of document selection processes and incorporating document quality metrics, the research offers valuable insights for the development of more accurate and reliable health QA systems. As the domain of medical research continues to evolve, so too will the methodologies for effectively navigating its vast literatures to support health information seeking and decision-making.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Overview of the trec 2019 decision track.
  2. Time-aware evidence ranking for fact-checking. Journal of Web Semantics, 71:100663.
  3. Construction of the literature graph in semantic scholar. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 84–91, New Orleans - Louisiana. Association for Computational Linguistics.
  4. Interpreting predictive probabilities: Model confidence or human label variation? In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 268–277, St. Julian’s, Malta. Association for Computational Linguistics.
  5. Grade guidelines: 3. rating the quality of evidence. Journal of clinical epidemiology, 64(4):401–406.
  6. A survey on machine reading comprehension systems. Natural Language Engineering, 28(6):683–732.
  7. Factors affecting the quality and reliability of online health information. Digital health, 6:2055207620948996.
  8. A review on fact extraction and verification. ACM Computing Surveys (CSUR), 55(1):1–35.
  9. Review of artificial intelligence-based question-answering systems in healthcare. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(2):e1487.
  10. Kathi Canese and Sarah Weis. 2013. Pubmed: the bibliographic database. The NCBI handbook, 2(1).
  11. Danqi Chen and Wen-tau Yih. 2020a. Open-domain question answering. In Proceedings of the 58th annual meeting of the association for computational linguistics: tutorial abstracts, pages 34–37.
  12. Danqi Chen and Wen-tau Yih. 2020b. Open-domain question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, pages 34–37, Online. Association for Computational Linguistics.
  13. Overview of the trec 2021 health misinformation track. In Text Retrieval Conference.
  14. Physical activity across adulthood and physical performance in midlife: findings from a british birth cohort. Am. J. Prev. Med., 41(4):376–384.
  15. The power of noise: Redefining retrieval for rag systems. arXiv preprint arXiv:2401.14887.
  16. Consumer health information and question answering: helping consumers find answers to their health-related information needs. Journal of the American Medical Informatics Association, 27(2):194–201.
  17. Current concepts in healthy aging and physical activity: A viewpoint. J. Aging Phys. Act., 27(5):755–761.
  18. Corticosteroids for aneurysmal subarachnoid haemorrhage and primary intracerebral haemorrhage. Cochrane Database Syst. Rev., (3):CD004583.
  19. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
  20. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
  21. Online health information seeking behavior: A systematic review. Healthcare, 9(12).
  22. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  23. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
  24. Biomedical question answering: a survey of approaches and challenges. ACM Computing Surveys (CSUR), 55(2):1–36.
  25. Felipe Kramer and Ángela Ortigoza. 2018. Ginkgo biloba for the treatment of tinnitus. Medwave, 18(6):e7295.
  26. QED: A Framework and Dataset for Explanations in Question Answering. Transactions of the Association for Computational Linguistics, 9:790–806.
  27. Less annotating, more classifying: Addressing the data scarcity issue of supervised machine learning with deep transfer learning and bert-nli. Political Analysis, 32(1):84–100.
  28. Qasa: advanced question answering on scientific articles. In International Conference on Machine Learning, pages 19036–19052. PMLR.
  29. SemEval-2023 task 11: Learning with disagreements (LeWiDi). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2304–2318, Toronto, Canada. Association for Computational Linguistics.
  30. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  31. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
  32. Health information seeking behaviors on social media during the covid-19 pandemic among american social networking site users: survey study. Journal of medical Internet research, 23(6):e29802.
  33. Results of the seventh edition of the bioasq challenge. In Machine Learning and Knowledge Discovery in Databases, pages 553–568, Cham. Springer International Publishing.
  34. Philhoon Oh and James Thorne. 2023. Detrimental contexts in open-domain question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11589–11605, Singapore. Association for Computational Linguistics.
  35. Bente Klarlund Pedersen. 2019. Which type of exercise keeps you young? Curr. Opin. Clin. Nutr. Metab. Care, 22(2):167–173.
  36. A pilot placebo controlled randomized trial of dexamethasone for chronic subdural hematoma. Can. J. Neurol. Sci., 43(2):284–290.
  37. Consumer health question answering using off-the-shelf components. In European Conference on Information Retrieval, pages 571–579. Springer.
  38. Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension. ACM Computing Surveys, 55(10):1–45.
  39. Online health information seeking among us adults: Measuring progress toward a healthy people 2020 objective. Public Health Reports, 134(6):617–625. PMID: 31513756.
  40. On the role of relevance in natural language processing tasks. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1785–1789.
  41. Averitec: A dataset for real-world claim verification with evidence from the web. Advances in Neural Information Processing Systems, 36.
  42. Dexamethasone administration and mortality in patients with brain abscess: A systematic review and meta-analysis. World Neurosurg., 115:257–263.
  43. B Søholm. 1998. Clinical improvement of memory and other cognitive functions by ginkgo biloba: review of relevant literature. Adv. Ther., 15(1):54–65.
  44. The choice of textual knowledge base in automated claim checking. ACM Journal of Data and Information Quality, 15(1):1–22.
  45. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
  46. David Vilares and Carlos Gómez-Rodríguez. 2019. HEAD-QA: A healthcare dataset for complex reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 960–966, Florence, Italy. Association for Computational Linguistics.
  47. Juraj Vladika and Florian Matthes. 2023a. Scientific fact-checking: A survey of resources and approaches. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6215–6230, Toronto, Canada. Association for Computational Linguistics.
  48. Juraj Vladika and Florian Matthes. 2023b. Sebis at SemEval-2023 task 7: A joint system for natural language inference and evidence retrieval from clinical trial reports. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 1863–1870, Toronto, Canada. Association for Computational Linguistics.
  49. Juraj Vladika and Florian Matthes. 2024. Comparing knowledge sources for open-domain scientific claim verification. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2103–2114, St. Julian’s, Malta. Association for Computational Linguistics.
  50. Healthfc: A dataset of health claims for evidence-based medical fact-checking.
  51. TREC: Experiment and evaluation in information retrieval, volume 63. Citeseer.
  52. SciFact-open: Towards open-domain scientific claim verification. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4719–4734, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  53. Psychosocial stress at work is associated with increased dementia risk in late life. Alzheimers. Dement., 8(2):114–120.
  54. Modeling information change in science communication with semantically matched paraphrases. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1783–1807, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  55. Alzheimer’s pathogenic mechanisms and underlying sex difference. Cell. Mol. Life Sci., 78(11):4907–4920.
  56. Retrieving and reading: A comprehensive survey on open-domain question answering.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Juraj Vladika (21 papers)
  2. Florian Matthes (79 papers)
Citations (2)