Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Building Chinese Biomedical Language Models via Multi-Level Text Discrimination (2110.07244v2)

Published 14 Oct 2021 in cs.CL and cs.AI

Abstract: Pre-trained LLMs (PLMs), such as BERT and GPT, have revolutionized the field of NLP, not only in the general domain but also in the biomedical domain. Most prior efforts in building biomedical PLMs have resorted simply to domain adaptation and focused mainly on English. In this work we introduce eHealth, a Chinese biomedical PLM built from scratch with a new pre-training framework. This new framework pre-trains eHealth as a discriminator through both token- and sequence-level discrimination. The former is to detect input tokens corrupted by a generator and recover their original identities from plausible candidates, while the latter is to further distinguish corruptions of a same original sequence from those of others. As such, eHealth can learn language semantics at both token and sequence levels. Extensive experiments on 11 Chinese biomedical language understanding tasks of various forms verify the effectiveness and superiority of our approach. We release the pre-trained model at \url{https://github.com/PaddlePaddle/Research/tree/master/KG/eHealth} and will also release the code later.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Quan Wang (130 papers)
  2. Songtai Dai (2 papers)
  3. Benfeng Xu (15 papers)
  4. Yajuan Lyu (16 papers)
  5. Yong Zhu (33 papers)
  6. Hua Wu (191 papers)
  7. Haifeng Wang (194 papers)
Citations (14)