BioSentVec: creating sentence embeddings for biomedical texts

Published 22 Oct 2018 in cs.CL, cs.AI, and cs.LG | (1810.09302v6)

Abstract: Sentence embeddings have become an essential part of today's NLP systems, especially together advanced deep learning methods. Although pre-trained sentence encoders are available in the general domain, none exists for biomedical texts to date. In this work, we introduce BioSentVec: the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. We evaluate BioSentVec embeddings in two sentence pair similarity tasks in different text genres. Our benchmarking results demonstrate that the BioSentVec embeddings can better capture sentence semantics compared to the other competitive alternatives and achieve state-of-the-art performance in both tasks. We expect BioSentVec to facilitate the research and development in biomedical text mining and to complement the existing resources in biomedical word embeddings. BioSentVec is publicly available at https://github.com/ncbi-nlp/BioSentVec

Abstract PDF Upgrade to Chat

Citations (205)

View on Semantic Scholar

Summary

The paper presents BioSentVec, which creates 700-dimensional sentence embeddings specifically tailored for biomedical and clinical texts.
The embeddings are developed using an adapted sent2vec model trained on approximately 181 million PubMed sentences and 41 million MIMIC-III sentences.
It outperforms general-domain embeddings in tasks like sentence similarity and multi-label classification, achieving state-of-the-art metrics such as improved Pearson correlation and F1-scores.

BioSentVec: Developing Sentence Embeddings for Biomedical Texts

This paper introduces BioSentVec, a new set of sentence embeddings specifically tailored for biomedical and clinical texts. The authors address a significant gap in NLP for the biomedical domain by creating embeddings trained on over 30 million documents from PubMed and the MIMIC-III Clinical Database. Previous general-domain embeddings often suffer from decreased performance when applied to domain-specific tasks, such as biomedical text mining. BioSentVec is designed to overcome these limitations.

Methodology

BioSentVec leverages the sent2vec model, adapting it to develop sentence embeddings. The embeddings are trained on two prominent biomedical corpora: PubMed, comprising approximately 181 million sentences, and MIMIC-III, containing around 41 million sentences. The resulting embeddings are 700-dimensional vectors. Despite attempts to include PMC full-text articles, no gain in performance was noted, reinforcing the focus on PubMed and MIMIC-III texts.

Evaluation and Results

The paper evaluates BioSentVec using two tasks:

Sentence Similarity: Performance was tested on the BIOSSES and MedSTS datasets. BioSentVec embeddings outperformed existing methods, such as doc2vec and averaging word embeddings, achieving state-of-the-art results in both unsupervised and supervised setups, as supported by metrics like Pearson correlation.
Multi-label Text Classification: Evaluated on the Hallmarks of Cancer corpus, BioSentVec demonstrated superior performance compared to other approaches, significantly improving F1-scores in multi-label text classification tasks.

Implications and Future Directions

The development of BioSentVec has profound implications for biomedical text mining. By providing robust sentence embeddings, the paper supports more accurate information retrieval, sentence classification, and question-answering tasks within biomedical literature. This research complements existing biomedical word embeddings, filling a crucial gap in the NLP ecosystem for biomedicine.

Looking forward, areas for future development include expanding embeddings to incorporate larger and more diverse corpora, thus potentially enhancing robustness and versatility. Additionally, further exploration of real-world applications beyond the tested datasets could provide deeper insights into the utility of BioSentVec in diverse biomedical contexts.

The availability of BioSentVec, including a Jupyter Notebook for practical applications, suggests potential ease of integration into existing biomedical research workflows, paving the way for enhanced analytical capabilities powered by deep learning techniques.

By addressing the limitations of out-of-domain general-purpose embeddings, this research contributes meaningfully to advancing the capabilities of NLP in biomedicine, providing researchers with tools better suited to their specific domain requirements.

Markdown Report Issue