SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology (2405.16115v1)
Abstract: The extraction and analysis of insights from medical data, primarily stored in free-text formats by healthcare workers, presents significant challenges due to its unstructured nature. Medical coding, a crucial process in healthcare, remains minimally automated due to the complexity of medical ontologies and restricted access to medical texts for training Natural Language Processing models. In this paper, we proposed a method, "SNOBERT," of linking text spans in clinical notes to specific concepts in the SNOMED CT using BERT-based models. The method consists of two stages: candidate selection and candidate matching. The models were trained on one of the largest publicly available dataset of labeled clinical notes. SNOBERT outperforms other classical methods based on deep learning, as confirmed by the results of a challenge in which it was applied.
- Anton Hristov et al., “Clinical text classification to snomed ct codes using transformers trained on linked open medical ontologies,” in Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, 2023, pp. 519–526.
- Javier Reyes-Aguillón et al., “Clinical named entity recognition and linking using bert in combination with spanish medical embeddings.,” in CLEF (Working Notes), 2022, pp. 341–349.
- Tim Benson, Principles of health interoperability HL7 and SNOMED, Springer Science & Business Media, 2012.
- “Entity linking with a knowledge base: Issues, techniques, and solutions,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 2, pp. 443–460, 2014.
- Will Hardman and others., “Snomed ct entity linking challenge,” 2024.
- “An introduction to deep learning in natural language processing: Models, techniques, and tools,” Neurocomputing, vol. 470, pp. 443–456, 2022.
- “Use of the systematized nomenclature of medicine clinical terms (snomed ct) for processing free text in health care: systematic scoping review,” Journal of medical Internet research, vol. 23, no. 1, pp. e24594, 2021.
- “Supporting snomed ct postcoordination with knowledge graph embeddings,” Journal of Biomedical Informatics, vol. 139, pp. 104297, 2023.
- “Mimic-iv, a freely accessible electronic health record dataset,” Scientific data, vol. 10, no. 1, pp. 1, 2023.
- “Label Studio: Data labeling software,” 2020-2022, Open source software available from https://github.com/heartexlabs/label-studio.
- “COMETA: A corpus for medical entity linking in the social media,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, Eds., Online, Nov. 2020, pp. 3122–3137, Association for Computational Linguistics.
- “A survey on deep learning for named entity recognition,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 1, pp. 50–70, 2022.
- “Self-alignment pretraining for biomedical entity representations,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2021, pp. 4228–4238.
- “Domain-specific language model pretraining for biomedical natural language processing,” 2020.
- “Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval,” Bioinformatics, vol. 39, no. 11, pp. btad651, 2023.
- Robert Tinn et al., “Fine-tuning large neural language models for biomedical natural language processing,” 2021.
- Mikhail Kulyabin et al., “A benchmark for clinical notes entity linking in the snomed ct clinical terminology,” https://github.com/MikhailKulyabin/SNOBERT.
- “ReFinED: An efficient zero-shot-capable approach to end-to-end entity linking,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, Anastassia Loukina, Rashmi Gangadharaiah, and Bonan Min, Eds., Hybrid: Seattle, Washington + Online, July 2022, pp. 209–220, Association for Computational Linguistics.
- Sunjun Kweon et al., “Publicly shareable clinical large language model built on synthetic clinical notes,” 2023.
- Mikhail Kulyabin (6 papers)
- Gleb Sokolov (1 paper)
- Aleksandr Galaida (1 paper)
- Andreas Maier (394 papers)
- Tomas Arias-Vergara (6 papers)