MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts (1902.09476v1)

Published 25 Feb 2019 in cs.CL and cs.LG

Abstract: This paper presents the formal release of MedMentions, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.

Authors (2)

Sunil Mohan (7 papers)
Donghui Li (7 papers)

Citations (145)

View on Semantic Scholar

Summary

The paper introduces MedMentions, a comprehensive corpus of over 4,000 PubMed abstracts annotated with 350K+ UMLS-linked biomedical mentions.
The methodology features meticulous annotation with a 97.3% agreement rate, enhancing precision in biomedical named entity recognition and linking.
Baseline evaluations using TaggerOne yielded mention-level F1 of 0.453 and document-level F1 of 0.548, establishing a benchmark for future research.

MedMentions: Comprehensive Biomedical Corpus Annotated with UMLS Concepts

The paper "MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts" by Sunil Mohan and Donghui Li introduces a robust resource designed to propel advancements in Biomedical Named Entity Recognition (NER) and Linking. MedMentions is an extensive annotated corpus that focuses on the recognition of biomedical concepts across various biomedical disciplines, utilizing the sophisticated hierarchical structure of the Unified Medical Language System (UMLS).

Corpus Characteristics and Contributions

MedMentions distinguishes itself with both its scope and granularity. It encompasses a significant collection of over 4,000 abstracts from PubMed, covering more than 350,000 linked mentions to biomedical concepts, derived from a vast UMLS ontology housing approximately 3.2 million concepts. The corpus serves not only as a resource for entity recognition but also targets comprehensive document retrieval endeavors which are increasingly critical given the complexity and volume of biomedical literature.

Key innovations include the broad coverage of 127 semantic types organized in a hierarchical structure. Among these, a subset termed "MedMentions ST21pv" was carved out to enhance semantic indexing and support precise document retrieval for researchers. This subset comprises 21 preferred semantic types and focuses on concepts with particular relevance to biomedical research. Importantly, this addresses the key challenge posed by traditional, narrowly focused annotated corpora by substantially broadening coverage both in terms of entity types and corpus size.

Methodology and Model Evaluation

Developed with insights into the necessity for richly annotated datasets, the MedMentions release surpasses previous benchmarks by providing an extensive corpus ripe for training advanced machine learning models. The paper emphasizes the necessity of such large datasets in fostering state-of-the-art models, particularly those requiring zero-shot learning capabilities. MedMentions was constructed with meticulous attention to annotation precision, reaching an agreement rate of 97.3% in a test set evaluation, which substantiates the dataset's reliability for research applications.

To underscore its utility, the paper presents baseline metrics for a joint entity recognition and linking model implemented via TaggerOne, an established semi-Markov model. Evaluated on mention-level and document-level performance using precision, recall, and F1 score, the model yields moderately robust results. Mention-level F1 scores stood at 0.453 and document-level at 0.548, highlighting the corpus's utility and setting a benchmark for future methods to improve upon.

Implications and Research Opportunities

The release of MedMentions provides a compelling resource intended to catalyze further advancements in the recognition, linking, and retrieval of biomedical entities. The corpus's scale and the detailed UMLS annotation invite novel algorithmic approaches, particularly in zero-shot learning and multi-type entity extraction. Its broad ontology and detailed annotated sub-corpora cater to diverse research needs from fundamental concept recognition to complex relationship extraction processes.

The prospect for future research emanating from MedMentions is vast. One critical avenue is enhancing entity linking algorithms' ability to handle unseen labels—a more pronounced challenge in this domain because of the immense vocabulary covered in UMLS. Additionally, the resource offers an intriguing testbed for exploring the scalability of NLP systems to larger and more difficult problems typical of biomedical data science.

In conclusion, MedMentions stands as a substantive contribution to the biomedical NLP field, addressing prior corpus limitations by offering a broader, more complex dataset suited for contemporary machine learning approaches. It sets a foundational benchmark, enabling researchers to pursue new paradigms in entity recognition accurately and efficiently. This corpus holds the potential to significantly contribute to the understanding and development of advanced biomedical information systems, thereby facilitating enhanced scientific discovery and clinical insights.

PDF Markdown

Related Papers

GitHub

GitHub - chanzuckerberg/MedMentions: A corpus of Biomedical papers annotated with mentions of UMLS entities. (291 stars)