Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping (2312.06457v1)

Published 11 Dec 2023 in cs.AI, cs.CL, and cs.IR

Abstract: Identifying disease phenotypes from electronic health records (EHRs) is critical for numerous secondary uses. Manually encoding physician knowledge into rules is particularly challenging for rare diseases due to inadequate EHR coding, necessitating review of clinical notes. LLMs offer promise in text understanding but may not efficiently handle real-world clinical documentation. We propose a zero-shot LLM-based method enriched by retrieval-augmented generation and MapReduce, which pre-identifies disease-related text snippets to be used in parallel as queries for the LLM to establish diagnosis. We show that this method as applied to pulmonary hypertension (PH), a rare disease characterized by elevated arterial pressures in the lungs, significantly outperforms physician logic rules ($F_1$ score of 0.62 vs. 0.75). This method has the potential to enhance rare disease cohort identification, expanding the scope of robust clinical research and care gap identification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “A Semiautomated Chart Review for Assessing the Development of Radiation Pneumonitis Using Natural Language Processing: Diagnostic Accuracy and Feasibility Study” In JMIR Medical Informatics 9.11, 2021, pp. e29241 DOI: 10.2196/29241
  2. “Reproducible variability: assessing investigator discordance across 9 research teams attempting to reproduce the same observational study” In Journal of the American Medical Informatics Association: JAMIA 30.5, 2023, pp. 859–868 DOI: 10.1093/jamia/ocad009
  3. Kin Wah Fung, Rachel Richesson and Olivier Bodenreider “Coverage of Rare Disease Names in Standard Terminologies and Implications for Patients, Providers, and Research” In AMIA Annual Symposium Proceedings 2014, 2014, pp. 564–572 URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419993/
  4. Jonathan M. Mortensen, Mark A. Musen and Natalya F. Noy “An empirically derived taxonomy of errors in SNOMED CT” In AMIA Annual Symposium Proceedings 2014, 2014, pp. 899–906 URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419962/
  5. “Limestone: high-throughput candidate phenotype generation via tensor factorization” In Journal of Biomedical Informatics 52, 2014, pp. 199–211 DOI: 10.1016/j.jbi.2014.07.001
  6. “Learning probabilistic phenotypes from heterogeneous EHR data” In Journal of Biomedical Informatics 58, 2015, pp. 156–165 DOI: 10.1016/j.jbi.2015.10.001
  7. “Electronic medical record phenotyping using the anchor and learn framework” In Journal of the American Medical Informatics Association: JAMIA 23.4, 2016, pp. 731–740 DOI: 10.1093/jamia/ocw011
  8. “Automated disease cohort selection using word embeddings from Electronic Health Records” In Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 23, 2018, pp. 145–156
  9. “sureLDA: A multidisease automated phenotyping method for the electronic health record” In Journal of the American Medical Informatics Association 27.8, 2020, pp. 1235–1243 DOI: 10.1093/jamia/ocaa079
  10. “Phe2vec: Automated disease phenotyping based on unsupervised embeddings from electronic health records” In Patterns 2.9, 2021, pp. 100337 DOI: 10.1016/j.patter.2021.100337
  11. “Abstract 11934: Natural Language Processing Models Can Be Trained to Accurately Recognize the Presence of Disease Within Clinical Notes” Publisher: American Heart Association In Circulation 146.Suppl_1, 2022, pp. A11934–A11934 DOI: 10.1161/circ.146.suppl_1.11934
  12. Asher Moldwin, Dina Demner-Fushman and Travis R. Goodwin “Empirical Findings on the Role of Structured Data, Unstructured Data, and their Combination for Automatic Clinical Phenotyping” In AMIA Summits on Translational Science Proceedings 2021, 2021, pp. 445–454 URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8378600/
  13. Amnon Catav “Less is More: Why Use Retrieval Instead of Larger Context Windows | Pinecone”, 2023 URL: https://www.pinecone.io/blog/why-use-retrieval-instead-of-larger-context/
  14. “Say Goodbye to Irrelevant Search Results: Cohere Rerank Is Here” In Context by Cohere, 2023 URL: https://txt.cohere.com/rerank/
  15. “One Embedder, Any Task: Instruction-Finetuned Text Embeddings” arXiv:2212.09741 [cs] arXiv, 2023 DOI: 10.48550/arXiv.2212.09741
  16. “Prevalence of Pulmonary Hypertension in the General Population: The Rotterdam Study” In PloS One 10.6, 2015, pp. e0130072 DOI: 10.1371/journal.pone.0130072
  17. “Haemodynamic definitions and updated clinical classification of pulmonary hypertension” Publisher: European Respiratory Society Section: Series In European Respiratory Journal 53.1, 2019 DOI: 10.1183/13993003.01913-2018
  18. “PaLM: Scaling Language Modeling with Pathways” arXiv:2204.02311 [cs] arXiv, 2022 DOI: 10.48550/arXiv.2204.02311
  19. “PaLM 2 Technical Report” arXiv:2305.10403 [cs] arXiv, 2023 DOI: 10.48550/arXiv.2305.10403
  20. “Emergent Abilities of Large Language Models” arXiv:2206.07682 [cs] arXiv, 2022 DOI: 10.48550/arXiv.2206.07682
  21. “Large Language Models are Zero-Shot Reasoners” arXiv:2205.11916 [cs] arXiv, 2023 DOI: 10.48550/arXiv.2205.11916
  22. “Addressing the Controversy of Estimating Pulmonary Arterial Pressure by Echocardiography” In Journal of the American Society of Echocardiography: Official Publication of the American Society of Echocardiography 29.2, 2016, pp. 93–102 DOI: 10.1016/j.echo.2015.11.001
  23. “Inaccuracy of Doppler echocardiographic estimates of pulmonary artery pressures in patients with pulmonary hypertension: implications for clinical practice” In Chest 139.5, 2011, pp. 988–993 DOI: 10.1378/chest.10-1269
  24. “Accuracy of Doppler echocardiography in the hemodynamic assessment of pulmonary hypertension” In American Journal of Respiratory and Critical Care Medicine 179.7, 2009, pp. 615–621 DOI: 10.1164/rccm.200811-1691OC
  25. “CT-Base Pulmonary Artery Measurementin the Detection of Pulmonary Hypertension” In Medicine 93.27, 2014, pp. e256 DOI: 10.1097/MD.0000000000000256
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Will E. Thompson (4 papers)
  2. David M. Vidmar (1 paper)
  3. Jessica K. De Freitas (2 papers)
  4. John M. Pfeifer (1 paper)
  5. Brandon K. Fornwalt (5 papers)
  6. Ruijun Chen (12 papers)
  7. Gabriel Altay (14 papers)
  8. Kabir Manghnani (3 papers)
  9. Andrew C. Nelsen (1 paper)
  10. Kellie Morland (1 paper)
  11. Martin C. Stumpe (22 papers)
  12. Riccardo Miotto (7 papers)
Citations (6)