Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BELHD: Improving Biomedical Entity Linking with Homonoym Disambiguation (2401.05125v1)

Published 10 Jan 2024 in cs.CL

Abstract: Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). A popular approach to the task are name-based methods, i.e. those identifying the most appropriate name in the KB for a given mention, either via dense retrieval or autoregressive modeling. However, as these methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance, especially for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). We therefore present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. Specifically, BELHD builds upon the BioSyn (Sung et al.,2020) model introducing two crucial extensions. First, it performs a preprocessing of the KB in which it expands homonyms with an automatically chosen disambiguating string, thus enforcing unique linking decisions. Second, we introduce candidate sharing, a novel strategy to select candidates for contrastive learning that enhances the overall training signal. Experiments with 10 corpora and five entity types show that BELHD improves upon state-of-the-art approaches, achieving the best results in 6 out 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the core prediction model and thus can also improve other methods, which we exemplify for GenBioEL (Yuan et al, 2022), a generative name-based BEL approach. Code is available at: link added upon publication.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Entity linking via explicit mention-mention coreference modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4644–4658, Seattle, United States. Association for Computational Linguistics.
  2. Bio-ID track overview. In BioCreative VI Challenge Evaluation Workshop, volume 482, page 376.
  3. Amos Bairoch. 2018. The Cellosaurus, a cell-line knowledge resource. Journal of biomolecular techniques: JBT, 29(2):25.
  4. Longformer: The long-document transformer. arXiv:2004.05150.
  5. O. Bodenreider. 2004. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32:267D–270.
  6. Gene: a gene-centered information resource at NCBI. Nucleic Acids Research, 43(D1):D36–D42.
  7. A lightweight neural model for biomedical entity linking. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 12657–12665.
  8. Comparative Toxicogenomics Database (CTD): update 2023. Nucleic Acids Research, 51:D1257–D1262.
  9. Highly parallel autoregressive entity linking with discriminative correction. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.
  10. Autoregressive entity retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  12. NCBI disease corpus: A resource for disease name recognition and concept normalization. Journal of Biomedical Informatics, 47:1–10.
  13. BELB: a biomedical entity linking benchmark. Bioinformatics.
  14. Linnaeus: a species name identification system for biomedical literature. BMC bioinformatics, 11(1):1–17.
  15. Learning dense representations for entity retrieval. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL).
  16. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  17. NLM-Chem-BC7: manually annotated full-text resources for chemical entity annotation and indexing in biomedical articles. Database, 2022.
  18. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. Journal of biomedical informatics, 118:103779.
  19. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  20. A comprehensive evaluation of biomedical entity linking models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14462–14478, Singapore. Association for Computational Linguistics.
  21. Bert might be overkill: A tiny but effective biomedical entity linker based on residual convolutional neural networks. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1631–1639, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  22. Contrastive representation learning: A framework and review. IEEE Access, 8:193907–193934.
  23. DNorm: disease name normalization with pairwise learning to rank. Bioinformatics, 29:2909–2917.
  24. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database, 2016(baw068).
  25. A comparative study of pretrained language models for long clinical text. Journal of the American Medical Informatics Association, 30:340–347.
  26. Self-alignment pretraining for biomedical entity representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4228–4238, Online. Association for Computational Linguistics.
  27. Assigning species information to corresponding genes by a sequence labeling framework. Database, 2022.
  28. S1000: a better taxonomic name corpus for biomedical information extraction. Bioinformatics, 39.
  29. Andrés Marzal and Enrique Vidal. 1993. Computation of normalized edit distance and applications. IEEE Trans. Pattern Anal. Mach. Intell., 15:926–932.
  30. Marcel Milich and Alan Akbik. 2023. Zelda: A comprehensive benchmark for supervised entity disambiguation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2061–2072, Dubrovnik, Croatia. Association for Computational Linguistics.
  31. Sunil Mohan and Donghui Li. 2019. MedMentions: A large biomedical corpus annotated with umls concepts. In In Proceedings of the 2019 Conference on Automated Knowledge Base Construction (AKBC 2019).
  32. The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text. PLOS ONE, 8(6):e65390.
  33. Entity disambiguation with entity definitions. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1297–1303, Dubrovnik, Croatia. Association for Computational Linguistics.
  34. GERBIL – benchmarking named entity recognition and linking consistently. Semantic Web, 9:605–625.
  35. On the surprising effectiveness of name matching alone in autoregressive entity linking. In Proceedings of the First Workshop on Matching From Unstructured and Structured Data (MATCHING 2023), pages 58–69, Toronto, ON, Canada. Association for Computational Linguistics.
  36. Federhen Scott. 2012. The NCBI Taxonomy database. Nucleic Acids Research, 40:D136–D143.
  37. Abbreviation definition identification based on automatic precision estimates. BMC Bioinformatics, 9.
  38. Biomedical entity representations with synonym marginalization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3641–3650, Online. Association for Computational Linguistics.
  39. Cross-domain data integration for named entity disambiguation in biomedical text. Findings of the Association for Computational Linguistics: EMNLP 2021.
  40. Beeds: Large-scale biomedical event extraction using distant supervision and question answering. In Proceedings of the 21st Workshop on Biomedical Language Processing, pages 298–309, Dublin, Ireland. Association for Computational Linguistics.
  41. Chih-Hsuan Wei and Hung-Yu Kao. 2011. Cross-species gene normalization by species inference. BMC Bioinformatics, 12.
  42. SR4GN: A species recognition software tool for gene normalization. PLOS ONE, 7:e38460.
  43. GNormPlus: An integrative approach for tagging genes, gene families, and protein domains. BioMed Research International, 2015:e918710.
  44. GNorm2: an improved gene name recognition and normalization system. Bioinformatics, 39.
  45. Scalable zero-shot entity linking with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6397–6407, Online. Association for Computational Linguistics.
  46. Generative biomedical entity linking via knowledge base-guided pre-training and synonyms-aware fine-tuning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4038–4048, Seattle, United States. Association for Computational Linguistics.
  47. Knowledge-rich self-supervision for biomedical entity linking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 868–880.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Samuele Garda (5 papers)
  2. Ulf Leser (42 papers)
Citations (2)