A Comparison of Word Embeddings for the Biomedical Natural Language Processing (1802.00400v3)

Published 1 Feb 2018 in cs.IR

Abstract: Word embeddings have been widely used in biomedical NLP applications as they provide vector representations of words capturing the semantic properties of words and the linguistic relationship between words. Many biomedical applications use different textual resources (e.g., Wikipedia and biomedical articles) to train word embeddings and apply these word embeddings to downstream biomedical applications. However, there has been little work on evaluating the word embeddings trained from these resources.In this study, we provide an empirical evaluation of word embeddings trained from four different resources, namely clinical notes, biomedical publications, Wikipedia, and news. We performed the evaluation qualitatively and quantitatively. For the qualitative evaluation, we manually inspected five most similar medical words to a given set of target medical words, and then analyzed word embeddings through the visualization of those word embeddings. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained on clinical notes and biomedical publications can capture the semantics of medical terms better, and find more relevant similar medical terms, and are closer to human experts' judgments, compared to these trained on Wikipedia and news. Second, there does not exist a consistent global ranking of word embedding quality for downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained on biomedical domain corpora do not necessarily have better performance than those trained on other general domain corpora for any downstream biomedical NLP tasks.

PDF Abstract

Evaluation of Word Embeddings in Biomedical NLP Tasks

The paper "A Comparison of Word Embeddings for the Biomedical Natural Language Processing" provides an empirical evaluation of word embeddings derived from various textual sources, specifically targeting their application in biomedical NLP. It explores the effectiveness of word embeddings trained on distinct corpora: clinical notes, biomedical publications, Wikipedia, and news, and evaluates them both qualitatively and quantitatively through measurements of semantic similarity and their impact on downstream biomedical NLP tasks.

Methods and Resources

The paper utilizes embeddings from four data sources:

Clinical Notes (EHR): Records spanning 15 years from Mayo Clinic with a vocabulary size of 103k.
Biomedical Publications (MedLit): A large subset from PubMed Central (PMC) with over 1 million articles.
Wikipedia and News (GloVe and Google News): Publicly available pre-trained embeddings.

Word embedding techniques were applied using the Word2Vec skip-gram model, parameterized according to task-specific requirements. For intrinsic evaluation, four datasets were utilized to explore semantic similarity: Pedersen, Hliaoutakis, MayoSRS, and UMNSRS, all containing medically annotated pairs of terms. Extrinsic evaluation involved a variety of NLP tasks, including Information Extraction (IE), Information Retrieval (IR), and Relation Extraction (RE), to test the impact of these embeddings on different applications.

Key Findings

Semantic Understanding: Clinical and biomedical corpora embeddings, i.e., EHR and MedLit, were found to align more with human expert judgments in semantic similarity tasks than general domain embeddings like GloVe and Google News. In intrinsic evaluations, EHR consistently showed a stronger correlation with expert-derived scores across all datasets.
Task Performance:
- In IE tasks, particularly in the i2b2 2006 smoking status extraction shared task, EHR embeddings achieved the highest F1 score of 0.900. Interestingly, Google News embeddings also performed commendably, indicating that carefully selected general corpus embeddings can sometimes suffice for specific biomedical tasks.
- For IR tasks, while embedding-based query expansion did not outperform traditional methods, embeddings from specific corpora didn't show significant improvement over general domain embeddings.
- In the DDIExtraction 2013 challenge for RE tasks, general domain embeddings (Google News) showed competitive performance against medically specific embeddings. This suggests that general linguistic context captured by such embeddings is valuable for text with general sentence structure but specific scientific content.
Cross-Domain Application: The paper highlights that word embeddings from non-domain-specific resources can still be relevant for certain biomedical NLP tasks. This is particularly useful when access to domain-specific data is limited.

Implications and Future Directions

The research demonstrates the nuanced performance of word embeddings across different tasks and underlines the importance of context when selecting embeddings for NLP applications. While embeddings from specific corpora can significantly enhance performance on dedicated tasks, certain tasks achieve comparable outcomes with publicly available, pre-trained, general-purpose embeddings.

Future research could expand testing to additional downstream tasks like entity recognition or document classification and investigate embeddings across diversified EHR datasets for broader generalizability. Moreover, integrating privacy-preserving methods for multi-institutional data could enrich the scope of shareable word embeddings in practice.

Conclusion

This paper underscores the lack of universality in word embeddings' effectiveness across distinct biomedical NLP tasks. While domain-specific embeddings tend to capture nuanced medical semantics more accurately, general embeddings are often practically sufficient, offering a cost-effective alternative. This reflects the potential flexibility in utilizing open-access, non-specific resources in biomedical data-driven applications, especially when domain data constraints exist.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Yanshan Wang (50 papers)
Sijia Liu (204 papers)
Naveed Afzal (4 papers)
Majid Rastegar-Mojarad (2 papers)
Liwei Wang (239 papers)
Feichen Shen (10 papers)
Paul Kingsbury (1 paper)
Hongfang Liu (38 papers)

Citations (324)

View on Semantic Scholar