Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing (2007.15779v6)

Published 31 Jul 2020 in cs.CL and cs.LG
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Abstract: Pretraining large neural LLMs, such as BERT, has led to impressive gains on many NLP tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain LLMs. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining LLMs from scratch results in substantial gains over continual pretraining of general-domain LLMs. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.

Domain-Specific LLM Pretraining for Biomedical Natural Language Processing

The paper "Domain-Specific LLM Pretraining for Biomedical Natural Language Processing" addresses a prevailing assumption in the field of NLP: the benefits of mixed-domain pretraining. The authors argue that domain-specific pretraining from scratch can be more effective, particularly in domains with a vast amount of unlabeled text, such as biomedicine.

Key Insights

The investigation presented in the paper leads to several important findings:

  1. Superior Performance of Domain-Specific Pretraining: Pretraining LLMs from scratch using domain-specific text (e.g., PubMed abstracts) outperforms the traditional continual pretraining from general-domain models. This contradicts the common assumption that leveraging out-domain text in pretraining is beneficial.
  2. Impact of In-Domain Vocabulary: The authors demonstrate that using an in-domain vocabulary, derived specifically from the target domain text, contributes significantly to the performance. For instance, biomedical terms are better represented, leading to more effective learning and representation of these terms within the model.
  3. Effect of Pretraining Corpus: Pretraining exclusively on PubMed abstracts yields better performance over using a mix of abstracts and full-text articles from PubMed Central. Additionally, when using the continuous pretraining on general domain text (such as Wikipedia), there is negligible to no benefit in biomedical tasks, reinforcing the superiority of text from the specific domain.

Experimental Evaluation

The authors compile a comprehensive benchmark named the Biomedical Language Understanding and Reasoning Benchmark (BLURB), featuring datasets across diverse tasks such as named entity recognition (NER), evidence-based medical information extraction (PICO), relation extraction, sentence similarity, document classification, and question answering.

In a detailed comparative paper, they demonstrate:

  • PubMedBERT Outperforms Others: Benchmarking results show that PubMedBERT, a domain-specific LLM pretrained from scratch using PubMed abstracts, consistently outperforms other models including BioBERT, SciBERT, and BlueBERT across the majority of BLURB tasks. For instance, PubMedBERT achieves an F1 of 85.62 in BC5-disease, compared to BioBERT’s 84.70.
  • Validation of Pretraining Choices: The paper scrutinizes various pretraining approaches and their effectiveness:
    • Whole-word masking (WWM) was found to lead to consistent performance improvements across tasks compared to word-piece tokenization.
    • PubMedBERT’s ability to outperform even with 50% of the pretraining compute used by its continuous pretraining counterparts further validates the domain-specific pretraining approach.

Implications and Future Directions

Practically, adopting a domain-specific pretraining approach could lead to more efficient and effective deployment of LLMs in specialized fields such as biomedicine. From a theoretical perspective, this work prompts a re-evaluation of the transfer learning assumption that more text is always better, highlighting the need to consider the relevance of the pretraining corpus to the target application.

The findings have significant implications for future developments in AI and NLP:

  • Increased Focus on Domain-Specific Pretraining: Encouraging more targeted pretraining efforts to harness the strengths of domain-specific data effectively.
  • Expansion of BLURB: Inclusion of additional datasets and tasks within BLURB to cover more subtleties and nuances in biomedical NLP, ensuring comprehensive benchmarking.
  • Cross-Domain Comparisons: Exploring the generalizability of these findings to other high-value domains, such as finance and law, enhancing the transfer learning framework.

In conclusion, this paper provides a compelling argument for reevaluating and restructuring pretraining methodologies concerning domain-specific tasks. Their results advocate for a paradigm shift towards domain-specific pretraining, supported by substantial empirical evidence, and contribute significantly to the advancement of NLP applications in specialized domains like biomedicine.

Access the BLURB benchmark and leaderboard for more detailed experimental results and to track ongoing research efforts.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yu Gu (218 papers)
  2. Robert Tinn (6 papers)
  3. Hao Cheng (190 papers)
  4. Michael Lucas (4 papers)
  5. Naoto Usuyama (22 papers)
  6. Xiaodong Liu (162 papers)
  7. Tristan Naumann (41 papers)
  8. Jianfeng Gao (344 papers)
  9. Hoifung Poon (61 papers)
Citations (1,522)
X Twitter Logo Streamline Icon: https://streamlinehq.com