Domain-Adaptive Pretraining (DAP)
- Domain-Adaptive Pretraining (DAP) is a technique that continues pretraining on in-domain data using original self-supervised objectives to tailor models for specific applications.
- It employs methods like masked language modeling and causal prediction across various modalities including text, video, and images to enhance domain-specific representation.
- Empirical studies show that DAP significantly boosts task performance even with limited specialized data, while addressing challenges like catastrophic forgetting.
Searching arXiv for recent and foundational papers on domain-adaptive pretraining. Domain-Adaptive Pretraining (DAP) is the continued optimization of a pretrained model on data drawn from a target domain, typically using the model’s original pretraining objective or a closely related self-supervised objective. In the language-model literature, DAP is also described as continued pretraining, post-training, domain-adaptive pretraining (DAPT), domain-adaptive continual pretraining (DACP), multilingual domain adaptive pretraining (MDAPT), or federated domain-adaptive pre-training (FDAPT), depending on the setting. The central rationale is consistent across these variants: adapt the representation space to a narrower distribution than the original pretraining mixture while preserving enough general competence to support downstream transfer (Gururangan et al., 2020, Salahuddin et al., 30 Jun 2025, Jørgensen et al., 2021).
1. Conceptual scope and historical development
A canonical formulation appeared in Gururangan et al.’s “Don’t Stop Pretraining,” which treated DAP as a second phase of masked language modeling on domain text before supervised fine-tuning, and showed gains across biomedical, computer science, news, and review tasks under both high- and low-resource conditions (Gururangan et al., 2020). Subsequent work generalized the same idea in several directions: multilingual settings in MDAPT (Jørgensen et al., 2021), dialogue systems with domain and task adaptive pretraining for GPT-2 (Zhang et al., 2021), continual and federated settings (Ke et al., 2023, Jiang et al., 2023), resource-constrained industrial deployment (Kim et al., 9 Jul 2025), and non-text modalities such as medical images, masked video pretraining, and point-cloud masked autoencoding (Mehmood et al., 2022, Mueller et al., 15 Sep 2025, Gao et al., 24 Oct 2025).
The literature distinguishes DAP from both ordinary fine-tuning and task-adaptive pretraining. In the process-industry study by Zhukova et al., DAPT is explicitly positioned between general-domain pretraining and task-specific fine-tuning: unlike fine-tuning, it leaves the original masked language modeling head intact and continues pretraining on unlabeled domain-related text (Zhukova et al., 28 Apr 2025). Gururangan et al. further separated DAP from task-adaptive pretraining (TAP), where the second pretraining phase uses the unlabeled training set of the downstream task rather than a broader domain corpus; their combined DAP+TAP setting gave the best score on every task reported in that study (Gururangan et al., 2020).
The term is no longer restricted to text. In primate behavior recognition, DAP meant continuing the self-supervised V-JEPA pretraining stage on unlabeled in-domain video while keeping the original masked-autoencoding loss unchanged (Mueller et al., 15 Sep 2025). In DAP-MAE, it denoted adaptive integration of cross-domain point-cloud data within masked autoencoder pretraining (Gao et al., 24 Oct 2025). In medical imaging, DAPT referred to a second pretraining stage on an intermediate medical dataset after ImageNet initialization, including partial, hybrid, and simplified-architecture variants designed to reduce compute (Mehmood et al., 2022). This broad usage suggests that “domain adaptation by continued pretraining” has become a modality-agnostic design pattern rather than a single NLP-specific technique.
2. Training objectives and optimization regimes
Despite the breadth of applications, DAP usually preserves the backbone’s original pretraining criterion. For masked LLMs, the objective is the standard MLM loss. One formulation given in the process-industry study is
where is the set of masked positions (Zhukova et al., 28 Apr 2025). Gururangan et al. used the same basic setup with 15% masking for RoBERTa-based adaptation (Gururangan et al., 2020).
For decoder-only LLMs, the objective is causal language modeling. In the cybersecurity study, Domain-Adaptive Continuous Pretraining minimizes
with perplexity monitored during training (Salahuddin et al., 30 Jun 2025). The same autoregressive continuation principle also underlies the GPT-2 dialog adaptation pipeline, where DAPT and TAPT both use the original next-token language-modeling loss on dialog utterances (Zhang et al., 2021).
For self-supervised video models, the objective can remain a masked prediction loss rather than token prediction. In V-JEPA-based primate behavior recognition, the DAP stage keeps the identical masked-autoencoding loss:
No extra regularizers are added during this stage (Mueller et al., 15 Sep 2025).
For continual industrial DACP, replay is introduced explicitly into the loss:
with in the full-scale experiments reported for telco and finance adaptation (Kim et al., 9 Jul 2025).
| Setting | Objective retained during DAP | Representative paper |
|---|---|---|
| Encoder LMs | Masked language modeling | (Gururangan et al., 2020) |
| Decoder-only LMs | Causal next-token prediction | (Salahuddin et al., 30 Jun 2025) |
| Video self-supervision | V-JEPA masked-autoencoding loss | (Mueller et al., 15 Sep 2025) |
| Continual industrial sLLMs | Autoregressive NLL with replay mixing | (Kim et al., 9 Jul 2025) |
| Point-cloud MAE | Reconstruction plus domain contrastive loss | (Gao et al., 24 Oct 2025) |
A practical consequence of this design choice is that DAP usually changes the data distribution more than it changes the optimization problem. This is one reason the method composes naturally with later stages such as instruction tuning, RLHF, or task-specific fine-tuning. Several studies make this sequencing explicit rather than treating DAP as a complete end-to-end solution (Salahuddin et al., 30 Jun 2025, Kim et al., 9 Jul 2025).
3. Domain corpora, data selection, and tokenization
The quality and composition of the adaptation corpus are decisive. Gururangan et al. used large domain corpora: 7.55B BioMed tokens from S2ORC full-text papers, 8.10B CS tokens, 6.66B News tokens, and 2.11B Review tokens (Gururangan et al., 2020). MDAPT instead assembled a fixed 10 million-sentence budget under three mixing strategies—ED, Mp+ED, and Mp+MWIKI—to adapt mBERT into a single multilingual domain-specific model for biomedical named entity recognition and financial sentence classification across seven languages (Jørgensen et al., 2021).
The cybersecurity study provides a contrasting data-efficiency profile. It curated a corpus of approximately 126 million words and approximately 132 million tokens from standards and regulations, academic literature, and technical books, with ablations at approximately 1M, 50M, and 119M tokens (Salahuddin et al., 30 Jun 2025). The authors explicitly argue that effective specialization can be achieved with orders-of-magnitude less data than full pretraining, and report competitive results using 118.8 million tokens versus 2.77 billion tokens in prior specialized work (Salahuddin et al., 30 Jun 2025).
A second line of work reduces corpus size by selecting or augmenting text rather than maximizing raw volume. “TextGram” filters RealNews by top in-domain IMDb bigrams, builds a sentence-sentence similarity graph, applies weighted PageRank, and keeps the top 250,000 out-domain sentences—25% of the original 1M RealNews pool—for DAP (Hiwarkhedkar et al., 2024). Zhukova et al.’s ICL-APT starts from 32K target shift-log paragraphs, retrieves top- nearest neighbors from domain-related and in-domain repositories using cosine similarity, concatenates three closest neighbors from each source with the seed text, and then performs continual MLM on the resulting augmented contexts (Zhukova et al., 28 Apr 2025). Their reported ICL-APT setup uses 0.12 GB of augmented data over 20 epochs and yields mean IR 21.81, exceeding both DAPT on 10.30 GB and DAPT+TAPT on 10.31 GB (Zhukova et al., 28 Apr 2025).
Tokenizer design can also be part of DAP. IGOT constructs a domain-specific tokenizer by selecting new tokens according to an information-gain criterion and continuing pretraining with the customized vocabulary. On OpenLane text, IGOT with LLaMA-7B reports 11.9% token saving, 12.2% training time saving, and 5.8% maximum GPU VRAM usage saving; with T5, the reported training-time saving reaches 31.5% (Feng et al., 2024). By contrast, several other studies deliberately keep the tokenizer fixed: the cybersecurity work makes no tokenizer changes and freezes the embedding layer, and the gaming-toxicity study retains the standard distilRoBERTa byte-pair tokenizer except when adding explicit separator or metadata tokens (Salahuddin et al., 30 Jun 2025, Schurger-Foy et al., 2 Apr 2025).
These results indicate that DAP corpus design is not a single recipe. Large in-domain corpora, multilingual mixtures, retrieved augmentations, graph-ranked subsets, and tokenizer adaptation all appear as viable ways of aligning the training distribution with a target domain.
4. Retention, efficiency, and deployment variants
A central engineering problem in DAP is how to inject domain knowledge without excessive forgetting, memory cost, or wall-clock time. One conservative strategy is parameter and schedule restriction. In the cybersecurity study, knowledge retention is encouraged by freezing the embedding layer, using a very small learning rate of , limiting training to 1–3 passes through the domain corpus, and making no vocabulary or tokenizer changes (Salahuddin et al., 30 Jun 2025). Industrial DACP adopts a different protection mechanism: a 1:1 replay mixture of domain and public general-domain data, followed by instruction-tuning to restore instruction-following behavior (Kim et al., 9 Jul 2025).
Other variants alter the optimization topology itself. FDAPT formulates domain-adaptive pretraining as a federated objective over clients, with FedAvg aggregation over local MLM or NSP losses, and introduces Frozen Federated Domain-Adaptive Pre-Training (FFDAPT), where each client freezes a consecutive block of layers at each communication round. Empirically, FFDAPT improves computational efficiency by 12.1% on average while remaining within 1% of vanilla FDAPT on downstream performance (Jiang et al., 2023).
Continual DAP introduces a different failure mode: catastrophic forgetting across a sequence of domains. The DAS method uses a soft-masking mechanism based on unit-importance estimates, a robustness-based proxy to preserve general knowledge in the original model, and a contrastive loss that integrates previously learned domain knowledge with the current full network. On six domains, DAS achieves average macro-F1 77.93% versus 76.40% for DAP-RoBERTa and 76.36% for naïve continual learning, with negative forgetting rates reported in macro-F1 (Ke et al., 2023).
Parameter-efficient adaptation has also become prominent. DoMIX trains separate LoRA modules for each domain in parallel, concatenates the domain subspaces, inserts a diagonal bridge matrix , freezes all 0 matrices, and then fine-tunes only 1, 2, and the task head. On continual DAP benchmarks, DoMIX reaches average accuracy/F1 81.67/77.84, exceeding Full RoBERTa, LoRA-full, and DAS-full baselines while reporting 87% lower DAP GPU memory than DAS and 58% lower DAP training time (Kim et al., 3 Jul 2025).
Medical imaging supplies yet another efficiency axis. Full DAPT on ResNet50 uses 23.5M trainable parameters, 215.9 training hours, and 94.9 kWh; the proposed partial, hybrid, and simplified-architecture variants reduce training hours and energy, with hybrid L2SB-PFT achieving 179.0 hours and 78.6 kWh while outperforming full DAPT on the reported development and external AUCs (Mehmood et al., 2022). The common principle across these otherwise different techniques is selective updating: freeze some parameters, replay some data, or route domain information through small modules rather than rewriting the full model.
5. Empirical effects across modalities and tasks
The empirical case for DAP was established early in NLP. Gururangan et al. reported that RoBERTa on ACL-ARC improved from 63.0 to 75.4 with CS-domain DAP, and to 75.6 with DAP+TAP; on SciERC the progression was 77.3 to 80.8 to 81.3; on ChemProt it was 81.9 to 84.2 to 84.4 (Gururangan et al., 2020). Their control experiments with irrelevant-domain DAP showed that the gains were domain-specific rather than an effect of additional pretraining alone (Gururangan et al., 2020).
Low-resource and multilingual settings show similar behavior. Yaseen and Langer pretrained XLM-RoBERTa on approximately 30,314 in-domain sentences across six languages for multilingual acronym extraction and obtained macro-F1 improvements from .854 to .866 after one pretraining epoch and to .868 after three epochs, with especially large gains in Persian (+0.03 F1) and Vietnamese (+0.07 F1) (Yaseen et al., 2022). MDAPT likewise found that a single multilingual domain-specific model can outperform the general multilingual model and perform close to monolingual counterparts across nine biomedical and financial datasets (Jørgensen et al., 2021).
Cybersecurity adaptation provides one of the clearest recent demonstrations of large-model specialization under constrained data. After DAP, LLaMA-3.3-70B reached 0.7184 on CTI-MCQ, 0.9330 on CyberMetric, and 0.8638 on SecEval, compared with 0.7052, 0.9260, and 0.5081 before adaptation; the same DAP model exceeded the reported Llama-Primus-Base scores of 0.667, 0.866, and 0.500 while using 118.8 million tokens versus 2.77 billion tokens (Salahuddin et al., 30 Jun 2025).
DAP effects are not confined to text understanding. In primate behavior recognition, V-JEPA + DAP improved PanAf500 Top-1 accuracy from 83.68% to 87.24%, a +3.56 percentage-point gain over plain V-JEPA and +6.1 percentage points over the previous best published model; on ChimpACT, mAP rose from 26.0 to 30.7, a +4.7 percentage-point gain over plain V-JEPA and +6.3 percentage points over the best prior model (Mueller et al., 15 Sep 2025). In point-cloud learning, DAP-MAE reported 95.18% object classification on ScanObjectNN and 88.45% facial expression recognition on Bosphorus after a single cross-domain pretraining phase (Gao et al., 24 Oct 2025).
Applied moderation and industrial service tasks show more heterogeneous outcomes, but still support the underlying premise. In gaming toxicity detection, pure DAP raised DOTA 2 balanced accuracy from 0.65 to 0.71, and metadata-enhanced DAP raised it to 0.75; MWIII results were more mixed, with pure DAP decreasing balanced accuracy from 0.72 to 0.70 and separator-token variants giving only marginal changes (Schurger-Foy et al., 2 Apr 2025). In industrial sLLMs, DACP produced telco-domain average gains of +41% to +69%, finance-domain average gains of +49% and +31% on two smaller backbones, and finance retrieval MRR improvement from 47.6 to 73.6, while general-domain changes remained within ±7% (Kim et al., 9 Jul 2025).
Taken together, these results support a recurring pattern already emphasized in the foundational study: the largest gains typically occur when the target distribution is materially different from the original pretraining mixture, or when downstream supervision is limited (Gururangan et al., 2020).
6. Misconceptions, limitations, and open directions
A common misconception is that DAP necessarily requires very large in-domain corpora. Several studies argue against this. The cybersecurity work reports state-of-the-art cybersecurity benchmark scores with approximately 118.8M tokens and notes that even 1M tokens can yield modest gains on smaller models (Salahuddin et al., 30 Jun 2025). TextGram reports IMDb F1 91.02% after DAP on a 250K-sentence selected RealNews subset, compared with 90.36% when pretraining on the full 1M RealNews corpus without selection (Hiwarkhedkar et al., 2024). ICL-APT similarly suggests that relevance-oriented augmentation can outperform large-scale classical DAPT with far less GPU time (Zhukova et al., 28 Apr 2025). This suggests that corpus relevance, mixing strategy, and model capacity can matter as much as sheer token count.
A second misconception is that DAP is equivalent to downstream fine-tuning. Multiple papers explicitly separate the two stages. DAPT in Zhukova et al. preserves the MLM head and uses unlabeled text (Zhukova et al., 28 Apr 2025); the DSTC-9 dialog system performs DAPT and TAPT before end-to-end multi-task fine-tuning (Zhang et al., 2021); industrial DACP adds standard instruction-tuning after continual pretraining to restore instruction-following (Kim et al., 9 Jul 2025). In the cybersecurity study, future work explicitly proposes instruction tuning or RLHF after DAP to recover any instruction-following capabilities (Salahuddin et al., 30 Jun 2025).
The main limitations are also consistent across studies. Catastrophic forgetting remains a concern: the cybersecurity work did not directly evaluate general benchmarks such as SuperGLUE or MMLU because of compute constraints (Salahuddin et al., 30 Jun 2025), and the industrial DACP paper notes replay-corpus mismatch, residual forgetting of approximately 5% on smaller models, and non-trivial compute requirements (Kim et al., 9 Jul 2025). Context limits can constrain adaptation quality: the cybersecurity models were truncated to 1,024–2,048 tokens (Salahuddin et al., 30 Jun 2025), while industrial DACP uses 32K contexts but still relies on fixed replay mixing rather than a dynamic curriculum (Kim et al., 9 Jul 2025). Some domains exhibit weaker or more uneven gains, as seen in MWIII toxicity detection and in the smallest cybersecurity model on complex SecEval tasks (Schurger-Foy et al., 2 Apr 2025, Salahuddin et al., 30 Jun 2025).
Current research directions therefore concentrate on more selective and more stable adaptation. Reported proposals include adapter-based or prefix-tuning hybrids, knowledge distillation during DAP, dynamic replay scheduling, multi-domain continual strategies, automated domain selection, vocabulary expansion, tokenizer optimization, and stronger evaluation of general-language retention (Salahuddin et al., 30 Jun 2025, Kim et al., 9 Jul 2025, Feng et al., 2024, Kim et al., 3 Jul 2025). A plausible implication is that DAP is converging toward a family of modular post-pretraining recipes—full-model, replay-based, federated, multilingual, retrieval-augmented, and parameter-efficient—rather than a single canonical pipeline.