Conceptualized Representation Learning for Chinese Biomedical Text Mining (2008.10813v1)

Published 25 Aug 2020 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via LLMs. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained LLM BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: https://github.com/alibaba-research/ChineseBLUE.

PDF Abstract

Conceptualized Representation Learning for Chinese Biomedical Text Mining

The paper "Conceptualized Representation Learning for Chinese Biomedical Text Mining" presents a novel approach to enhancing the performance of pre-trained LLMs, specifically BERT, within the context of Chinese biomedical text mining. The authors recognize the distinct challenges presented by this domain, including complex language structures and the long-tail distributions of biomedical concepts, which are not adequately addressed by models trained on general corpus data.

Core Contributions

The paper puts forth a conceptualized representation learning approach designed to infuse biomedical knowledge into pre-trained LLMs. The authors employ a sophisticated masking strategy, termed "coarse-to-fine masking," which includes whole entity masking and whole span masking. These strategies explicitly incorporate biomedical knowledge into the model, thereby improving its ability to capture and utilize domain-specific information.

Additionally, the paper introduces the Chinese Biomedical Language Understanding Evaluation benchmark, known as ChineseBLUE. This benchmark serves as the foundation for evaluating the performance of various Chinese pre-trained models, including BERT, BERT-wwm, RoBERTa, and the proposed approach.

Experimental Methodology and Results

The proposed model, MC-BERT, is pre-trained on a diverse set of Chinese biomedical datasets. It employs whole entity masking to obscure medical entities and whole span masking for Chinese phrases to incorporate linguistic and domain knowledge into the model’s training. The approach is benchmarked against several NLP tasks, including named entity recognition (NER), paraphrase identification, question answering, information retrieval, intent classification, and text classification.

Results indicate that MC-BERT significantly outperforms BERT-base models across all tasks in the ChineseBLUE benchmark. Notably, enhancements in named entity recognition and paraphrase identification demonstrate the efficacy of integrating domain knowledge into pre-trained models. The performance gains achieved by MC-BERT are particularly noteworthy, presenting an advancement over existing Chinese pre-trained models such as BERT-wwm and RoBERTa, thus affirming the importance of domain-specific training.

Implications and Future Directions

The approach delineated in this paper underscores the significance of domain-adapted pre-training for NLP models, particularly in fields characterized by specialized terminology and complex language structures, such as biomedical text mining. By introducing a domain-specific evaluation benchmark, ChineseBLUE, the authors provide a valuable resource for the further development and assessment of models within this crucial area.

Practically, the approach facilitates more accurate and efficient biomedical information retrieval and analysis, which is essential for practitioners and researchers in the biomedical field. Theoretically, the work contributes to the broader understanding of how domain knowledge can be systematically and effectively embedded into LLMs.

Future research could explore the adaptation of this approach to other language domains or the incorporation of concept-level knowledge into the model. Greater exploration into corpus diversity and its effects on model adaptability can also offer insights into pre-trained model optimization.

In conclusion, the paper offers a robust framework for improving the adaptability of pre-trained LLMs to specialized domains, a step forward in the growing field of biomedical NLP.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Ningyu Zhang (148 papers)
Qianghuai Jia (7 papers)
Kangping Yin (2 papers)
Liang Dong (38 papers)
Feng Gao (239 papers)
Nengwei Hua (3 papers)

Citations (62)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - alibaba-research/ChineseBLUE: Chinese Biomedical Language Understanding Evaluation benchmark (ChineseBLUE) (520 stars)