Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Biomedical Entity Linking for Dutch: Fine-tuning a Self-alignment BERT Model on an Automatically Generated Wikipedia Corpus (2405.11941v1)

Published 20 May 2024 in cs.CL

Abstract: Biomedical entity linking, a main component in automatic information extraction from health-related texts, plays a pivotal role in connecting textual entities (such as diseases, drugs and body parts mentioned by patients) to their corresponding concepts in a structured biomedical knowledge base. The task remains challenging despite recent developments in natural language processing. This paper presents the first evaluated biomedical entity linking model for the Dutch language. We use MedRoBERTa.nl as base model and perform second-phase pretraining through self-alignment on a Dutch biomedical ontology extracted from the UMLS and Dutch SNOMED. We derive a corpus from Wikipedia of ontology-linked Dutch biomedical entities in context and fine-tune our model on this dataset. We evaluate our model on the Dutch portion of the Mantra GSC-corpus and achieve 54.7% classification accuracy and 69.8% 1-distance accuracy. We then perform a case study on a collection of unlabeled, patient-support forum data and show that our model is hampered by the limited quality of the preceding entity recognition step. Manual evaluation of small sample indicates that of the correctly extracted entities, around 65% is linked to the correct concept in the ontology. Our results indicate that biomedical entity linking in a language other than English remains challenging, but our Dutch model can be used to for high-level analysis of patient-generated text.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Clustering-based Inference for Biomedical Entity Linking. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2598–2608, Online. Association for Computational Linguistics.
  2. COMETA: A Corpus for Medical Entity Linking in the Social Media. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3122–3137, Online. Association for Computational Linguistics.
  3. Olivier Bodenreider. 2004. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270.
  4. BERT Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  5. How do others cope? Extracting coping strategies for adverse drug events from social media. Journal of Biomedical Informatics, 139:104228.
  6. Jennifer D’Souza and Vincent Ng. 2015. Sieve-based entity linking for the biomedical domain. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 297–302.
  7. Evan French and Bridget T McInnes. 2022. An overview of biomedical entity linking throughout the years. Journal of Biomedical Informatics, page 104252.
  8. BELB: a biomedical entity linking benchmark. Bioinformatics, 39(11):btad698.
  9. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.
  10. A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC. Journal of the American Medical Informatics Association, 22(5):948–956.
  11. Multi-domain clinical natural language processing with MedCAT: the medical concept annotation toolkit. Artificial intelligence in medicine, 117:102083.
  12. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  13. BEST: next-generation biomedical entity search tool for knowledge discovery from biomedical literature. PloS one, 11(10):e0164680.
  14. Self-Alignment Pretraining for Biomedical Entity Representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4228–4238, Online. Association for Computational Linguistics.
  15. Daniel Loureiro and Alípio Mário Jorge. 2020. Medlinker: Medical entity linking with neural representations and dictionary matching. In European Conference on Information Retrieval, pages 230–237. Springer.
  16. UMLS-Interface and UMLS-Similarity: open source software for measuring paths and semantic similarity. In AMIA annual symposium proceedings, volume 2009, page 431. American Medical Informatics Association.
  17. GC Miller and H Britt. 1995. A new drug classification for computer systems: the ATC extension code. International journal of bio-medical computing, 40(2):121–124.
  18. Normalized names for clinical drugs: RxNorm at 6 years. Journal of the American Medical Informatics Association, 18(4):441–448.
  19. Task 1: ShARe/CLEF eHealth Evaluation Lab 2013. CLEF (working notes), 1179.
  20. The added value of text from Dutch general practitioner notes in predictive modeling. Journal of the American Medical Informatics Association, 30(12):1973–1984.
  21. Annotation-preserving machine translation of English corpora to validate Dutch clinical concept extraction tools. medRxiv, pages 2024–03.
  22. Biomedical Entity Representations with Synonym Marginalization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3641–3650, Online. Association for Computational Linguistics.
  23. Biomedical Entity Linking with Contrastive Context Matching. arXiv preprint arXiv:2106.07583.
  24. Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. Journal of biomedical informatics, 121:103880.
  25. Stella Verkijk and Piek Vossen. 2021. Medroberta.nl: a language model for Dutch electronic health records. Computational Linguistics in the Netherlands Journal, 11:141–159.
  26. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledge base. Communications of the ACM, 57(10):78–85.
  27. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5022–5030.
  28. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic acids research, 46(D1):D1074–D1082.
  29. Generative biomedical entity linking via knowledge base-guided pre-training and synonyms-aware fine-tuning. arXiv preprint arXiv:2204.05164.
  30. Knowledge-Rich Self-Supervision for Biomedical Entity Linking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 868–880, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Fons Hartendorp (1 paper)
  2. Tom Seinen (1 paper)
  3. Erik van Mulligen (1 paper)
  4. Suzan Verberne (57 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com