A Hybrid Framework with Large Language Models for Rare Disease Phenotyping (2405.10440v3)
Abstract: Rare diseases pose significant challenges in diagnosis and treatment due to their low prevalence and heterogeneous clinical presentations. Unstructured clinical notes contain valuable information for identifying rare diseases, but manual curation is time-consuming and prone to subjectivity. This study aims to develop a hybrid approach combining dictionary-based NLP tools with LLMs to improve rare disease identification from unstructured clinical reports. We propose a novel hybrid framework that integrates the Orphanet Rare Disease Ontology (ORDO) and the Unified Medical Language System (UMLS) to create a comprehensive rare disease vocabulary. The proposed hybrid approach demonstrates superior performance compared to traditional NLP systems and standalone LLMs. Notably, the approach uncovers a significant number of potential rare disease cases not documented in structured diagnostic records, highlighting its ability to identify previously unrecognized patients.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Olivier Bodenreider. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270.
- Ontology-driven and weakly supervised rare disease identification from clinical notes. BMC Medical Informatics and Decision Making, 23(1):86.
- Extracting information from the text of electronic medical records to improve case detection: a systematic review. Journal of the American Medical Informatics Association, 23(5):1007–1015.
- Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PloS one, 13(2):e0192360.
- Clinical research for rare disease: opportunities, challenges, and solutions. Molecular genetics and metabolism, 96(1):20–26.
- Online mendelian inheritance in man (omim). Human mutation, 15(1):57–61.
- Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1.
- Phekb: a catalog and workflow for creating electronic phenotype algorithms for transportability. Journal of the American Medical Informatics Association, 23(6):1046–1052.
- Medcat–medical concept annotation tool. arXiv preprint arXiv:1912.10166.
- Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Behrt: transformer for electronic health records. Scientific reports, 10(1):7155.
- Large language models vote: Prompting for rare disease identification. arXiv preprint arXiv:2308.12890.
- World Health Organization. 2004. International Statistical Classification of Diseases and related health problems: Alphabetical index, volume 3. World Health Organization.
- World Health Organization et al. 1988. International classification of diseases—ninth revision (icd-9). Weekly Epidemiological Record= Relevé épidémiologique hebdomadaire, 63(45):343–344.
- Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86.
- The human phenotype ontology: a tool for annotating and analyzing human hereditary disease. The American Journal of Human Genetics, 83(5):610–615.
- Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association, 17(5):507–513.
- Why rare diseases are an important medical and social issue. The Lancet, 371(9629):2039–2041.
- Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 27(2):443–460.
- James K Stoller. 2018. The challenge of rare diseases. Chest, 153(6):1309–1314.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521.
- Orphanet: a european database for rare diseases. Nederlands tijdschrift voor geneeskunde, 152(9):518–519.
- Semehr: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research. Journal of the American Medical Informatics Association, 25(5):530–537.
- Enabling phenotypic big data with phenorm. Journal of the American Medical Informatics Association, 25(1):54–60.
- Jinge Wu (18 papers)
- Hang Dong (65 papers)
- Zexi Li (26 papers)
- Arijit Patra (6 papers)
- Honghan Wu (33 papers)
- Haowei Wang (32 papers)
- Runci Li (1 paper)
- Chengliang Dai (11 papers)
- Waqar Ali (7 papers)
- Phil Scordis (1 paper)