Solving the Right Problem is Key for Translational NLP: A Case Study in UMLS Vocabulary Insertion (2311.15106v1)
Abstract: As the immense opportunities enabled by LLMs become more apparent, NLP systems will be increasingly expected to excel in real-world settings. However, in many instances, powerful models alone will not yield translational NLP solutions, especially if the formulated problem is not well aligned with the real-world task. In this work, we study the case of UMLS vocabulary insertion, an important real-world task in which hundreds of thousands of new terms, referred to as atoms, are added to the UMLS, one of the most comprehensive open-source biomedical knowledge bases. Previous work aimed to develop an automated NLP system to make this time-consuming, costly, and error-prone task more efficient. Nevertheless, practical progress in this direction has been difficult to achieve due to a problem formulation and evaluation gap between research output and the real-world task. In order to address this gap, we introduce a new formulation for UMLS vocabulary insertion which mirrors the real-world task, datasets which faithfully represent it and several strong baselines we developed through re-purposing existing solutions. Additionally, we propose an effective rule-enhanced biomedical LLM which enables important new model behavior, outperforms all strong baselines and provides measurable qualitative improvements to editors who carry out the UVI task. We hope this case study provides insight into the considerable importance of problem formulation for the success of translational NLP solutions.
- Evaluating biomedical word embeddings for vocabulary alignment at scale in the UMLS Metathesaurus using Siamese networks. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, pages 82–87, Dublin, Ireland. Association for Computational Linguistics.
- Olivier Bodenreider. 2004. The Unified Medical Language System (UMLS): Integrating Biomedical Terminology. Nucleic acids research, 32 Database issue:D267–70.
- Ohio Supercomputer Center. 1987. Ohio supercomputer center.
- Reveal the unknown: Out-of-knowledge-base mention discovery with entity linking.
- Jennifer D’Souza and Vincent Ng. 2015. Sieve-based entity linking for the biomedical domain. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 297–302, Beijing, China. Association for Computational Linguistics.
- Parallel sequence tagging for concept recognition. BMC Bioinformatics, 22.
- Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. ACM Trans. Comput. Healthcare, 3(1).
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
- DNorm: disease name normalization with pairwise learning to rank. Bioinformatics, 29:2909 – 2917.
- Robert Leaman and Zhiyong Lu. 2016. TaggerOne: joint named entity recognition and normalization with semi-Markov Models. Bioinformatics, 32(18):2839–2846.
- BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
- Self-alignment pretraining for biomedical entity representations. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4228–4238, Online. Association for Computational Linguistics.
- Lexical methods for managing variation in biomedical terminologies. Proceedings. Symposium on Computer Applications in Medical Care, pages 235–9.
- Context-enriched learning models for aligning biomedical vocabularies at scale in the umls metathesaurus. Proceedings of the ACM Web Conference 2022.
- Biomedical vocabulary alignment at scale in the umls metathesaurus. Proceedings of the … International World-Wide Web Conference. International WWW Conference, 2021:2672 – 2683.
- Pedro Ruas and Francisco M. Couto. 2022. Nilinker: Attention-based approach to nil entity linking. Journal of Biomedical Informatics, 132:104137.
- Ubert: A novel language model for synonymy prediction at scale in the umls metathesaurus. ArXiv, abs/2204.12716.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Scalable zero-shot entity linking with dense entity retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6397–6407, Online. Association for Computational Linguistics.
- Coder: Knowledge-infused cross-lingual medical term embedding for term normalization. Journal of Biomedical Informatics, page 103983.
- Knowledge-rich self-supervision for biomedical entity linking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 868–880, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data, 6.
- Bernal Jimenez Gutierrez (27 papers)
- Yuqing Mao (2 papers)
- Vinh Nguyen (25 papers)
- Kin Wah Fung (2 papers)
- Yu Su (138 papers)
- Olivier Bodenreider (7 papers)