Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit (2010.01165v2)
Abstract: Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets, and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.
- Zeljko Kraljevic (11 papers)
- Thomas Searle (10 papers)
- Anthony Shek (7 papers)
- Lukasz Roguski (6 papers)
- Kawsar Noor (5 papers)
- Daniel Bean (8 papers)
- Aurelie Mascio (4 papers)
- Leilei Zhu (2 papers)
- Amos A Folarin (13 papers)
- Angus Roberts (13 papers)
- Rebecca Bendayan (7 papers)
- Mark P Richardson (1 paper)
- Robert Stewart (19 papers)
- Wai Keong Wong (5 papers)
- Zina Ibrahim (17 papers)
- Richard JB Dobson (19 papers)
- Anoop D Shah (1 paper)
- James T Teo (3 papers)