Don't Stop Pretraining: Adapt Language Models to Domains and Tasks (2004.10964v3)

Published 23 Apr 2020 in cs.CL and cs.LG

Abstract: LLMs pretrained on text from a wide variety of sources form the foundation of today's NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task's unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

PDF Abstract

Overview of "Don't Stop Pretraining: Adapt LLMs to Domains and Tasks"

The paper "Don't Stop Pretraining: Adapt LLMs to Domains and Tasks," authored by Gururangan et al., investigates the efficacy of continuing the pretraining process of LLMs such as RoBERTa on domain-specific and task-specific datasets. This research addresses a pertinent question in the field of NLP: despite the broad success of LLMs pretrained on vast, heterogeneous corpora, is further adaptation to narrower domains and specific tasks beneficial?

The paper spans four distinct domains—biomedical and computer science publications, news articles, and product reviews—and evaluates performance across eight classification tasks. The key findings from their comprehensive experiments indicate that additional domain-adaptive pretraining (DAPT) and task-adaptive pretraining (TAPT) lead to significant performance improvements, even in both high- and low-resource settings.

Domain-Adaptive Pretraining (DAPT)

The authors extend the pretraining of RoBERTa on large, unlabeled datasets specific to each domain. They demonstrate that this domain-focused adaptation, even when the original pretraining corpus is extensive and diverse, consistently enhances task performance. The gains from DAPT are particularly pronounced for tasks where the target domain is markedly different from the sources in the original pretraining corpus, such as biomedical and computer science texts.

Key Findings in DAPT

Performance Gains: Across all domains, DAPT improved task performance. For instance, in the biomedical domain, it resulted in performance increases on tasks like ChemProt and RCT classification.
Importance of Domain Relevance: A comparison with pretraining on irrelevant domains confirmed that the improvements are driven by the relevance of the domain-specific data.

Task-Adaptive Pretraining (TAPT)

TAPT involves further pretraining the LLM on the unlabeled data specific to the task at hand. Given the smaller size of task-specific datasets compared to domain datasets, TAPT is less computationally expensive. The results indicated that TAPT alone can yield substantial improvements, often competitive with DAPT.

Key Findings in TAPT

Efficiency: TAPT offers a cost-effective approach to improve performance using much smaller and task-focused corpora.
High Performance: For certain tasks, such as those in the news and product review domains, TAPT yielded performance improvements as good as or better than DAPT.

Combined DAPT and TAPT

The research explores applying both DAPT and TAPT sequentially. This combined approach capitalizes on the strengths of both methods—domain-wide relevance from DAPT and task-specific fine-tuning from TAPT. The combination consistently resulted in the best performance across all tasks.

Automated Data Selection for TAPT

In scenarios where additional, large pools of unlabeled data specific to a task are unavailable, the authors explore automated methods for data selection. They proposed leveraging embeddings from a lightweight Variational Autoencoder-based model (VAMPIRE) to retrieve task-relevant unlabeled text from a larger in-domain corpus.

Key Findings in Automated Data Selection

Effective Selection: The nearest-neighbor based selection methods substantially improved performance over random selection, approximating the gains observed with large-scale DAPT but at a fraction of the computational cost.

Practical and Theoretical Implications

The implications of this research are manifold:

Practical: Practitioners can significantly enhance model performance by adopting multi-phase pretraining strategies. The strategies are cost-effective and adaptable to various resource constraints.
Theoretical: The results reinforce the hypothesis that the complexity and diversity inherent in a single domain cannot be entirely captured by a broadly-pretrained LLM. Multi-phase adaptation strategies are crucial for addressing domain-specific nuances.

Future Directions

The paper opens several pathways for further research:

Efficient Data Selection: Development of more sophisticated data selection techniques could further optimize the task-relevant pretraining process.
Curriculum Learning: Investigating methods to balance domain and task adaptation phases dynamically could present new opportunities for improving model efficiency and performance.
Benchmarking Generalization: Establishing benchmarks that evaluate the ability of models to generalize across multiple domains and tasks can drive advancements in this area.

Overall, Gururangan et al.'s work underscores the enduring significance of domain and task adaptation in NLP, providing a robust framework and concrete evidence for its advantages.