MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model (2109.06605v1)

Published 14 Sep 2021 in cs.CL

Abstract: Domain adaptive pretraining, i.e. the continued unsupervised pretraining of a LLM on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a LLM to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets-for biomedical named entity recognition and financial sentence classification-covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Rasmus Kær Jørgensen (1 paper)
Mareike Hartmann (17 papers)
Xiang Dai (18 papers)
Desmond Elliott (53 papers)

Citations (11)

View on Semantic Scholar

MDAPT: Multilingual Domain Adaptive Pretraining in a Single Model (2109.06605v1)

Related Papers