Domain-Adaptive Continuous Pretraining (DAP)

Updated 5 March 2026

DAP is a framework where pre-trained language models are further trained on domain-specific corpora using self-supervised objectives like MLM to capture specialized features.
It employs strategies such as LoRA modules, knowledge integration, and replay buffering to enhance downstream task performance while minimizing catastrophic forgetting.
Continual DAP leverages modular adapters and parallel adaptation to efficiently manage multi-domain scenarios with reduced computational costs.

Domain-Adaptive Continuous Pretraining (DAP), sometimes also termed Domain-Adaptive Pretraining (DAPT) or Domain-Adaptive Continual Pretraining (DACP), is a framework in which a pre-trained LLM is further pretrained on large, unlabeled domain-specific corpora. This continued adaptation enables the model to acquire specialized, domain-relevant representations that significantly enhance downstream supervised performance on tasks from the target domain, while minimizing catastrophic forgetting of general language knowledge. The formalism, methodologies, variants, and empirical characteristics of DAP—and its efficient and continual extensions—are detailed below, including objectives, algorithmic formulations, downstream effects, and limitations.

1. Formal Objective and Core Methodology

The canonical DAP protocol operates as follows: a base pretrained LLM with parameters $\theta_0$ , originally trained on a broad, generic corpus (e.g., Wikipedia, CommonCrawl), is exposed to a large, unlabeled corpus $D_{\text{domain}}$ drawn from the intended domain of specialization (e.g., biomedical literature, legal documents, customer service dialogues).

The adaptation consists of continued optimization for a self-supervised objective. Most commonly, this is masked language modeling (MLM) for Transformer encoder models (e.g., BERT, RoBERTa), or autoregressive next-token prediction for decoder models (e.g., GPT, LLaMA):

$\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$

where $\mathcal{M}(x)$ denotes the set of masked positions in input $x$ (Gururangan et al., 2020). The parameters $\theta$ are initialized from $\theta_0$ and optimized over $D_{\text{domain}}$ for a fixed number of steps or epochs.

Variants targeting instruction-following LLMs or task-oriented dialogue leverage the same principle but switch between next-token, MLM, or hybrid objectives depending on the model class (Kim et al., 9 Jul 2025, Zhang et al., 2021).

For models in resource-constrained or continual settings, DAP may be adapted using parameter-efficient tuning modules (e.g., LoRA) (Kim et al., 3 Jul 2025), auxiliary data selection, replay-based corpus balancing (Kim et al., 9 Jul 2025), or kNN-based in-context augmentation (Zhukova et al., 28 Apr 2025).

2. Continual and Multi-Domain DAP

Standard DAP operates over a single domain corpus. However, application environments often present models with a sequence of domains $D_1, D_2, ..., D_K$ (e.g., incrementally arriving industry datasets). Continual DAP (also "continual DAPT", "DACP", or "multi-domain DAP") generalizes the objective to:

$\mathcal{L}_{\text{cont-DAP}} = \sum_{d=1}^{K} \mathbb{E}_{x \sim D_d}\left[ -\sum_{t \in \mathcal{M}(x)} \log p_\theta(x_t \mid x_{\backslash \mathcal{M}(x)}) \right]$

Naïvely updating $D_{\text{domain}}$ 0 sequentially with each $D_{\text{domain}}$ 1 leads to catastrophic forgetting and sensitivity to domain order (Ke et al., 2023). Solutions include:

Soft-masking and knowledge integration: Employing per-unit importance scores and gradient soft-masks to protect parameters critical to previously learned domains, combined with contrastive losses to promote domain-unique representations (Ke et al., 2023).
LoRA-based modularization (DoMIX): Attaching separate low-rank adapters $D_{\text{domain}}$ 2 for each domain, learning them in parallel with the base model frozen ( $D_{\text{domain}}$ 3 fixed). At fine-tuning, domain adapters are concatenated and mixed with a trainable bridge matrix $D_{\text{domain}}$ 4, maintaining both modularity and transfer (Kim et al., 3 Jul 2025).
Corpus replay and mixture scheduling: In DACP, a 50–50 mixture of domain and general corpora is maintained within each batch, mitigating forgetting and balancing specialized versus general capabilities. Empirically, this achieves only 1–2% absolute degradation in general benchmarks after multi-domain adaptation (Kim et al., 9 Jul 2025).

Continual DAP methods markedly outperform monolithic, sequential tuning in both robustness to data order and avoidance of catastrophic forgetting (Kim et al., 3 Jul 2025, Ke et al., 2023).

3. Efficiency and Parameter-Efficient DAP

The high computational burden and memory footprint of conventional DAP, especially when sequentially adapting large models to many domains, motivates parameter-efficient approaches:

LoRA modules: For each domain $D_{\text{domain}}$ 5, small low-rank matrices $D_{\text{domain}}$ 6, $D_{\text{domain}}$ 7 ( $D_{\text{domain}}$ 8) are inserted into all linear layers, with $D_{\text{domain}}$ 9 frozen. Each domain is adapted in parallel, independently, and model parameters grow only linearly with the number of domains (Kim et al., 3 Jul 2025).
Bridge module for knowledge mixing: At fine-tuning, adapters are concatenated and mixed via a trainable bridge matrix $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 0, implementing soft selection of relevant domains for each downstream task (Kim et al., 3 Jul 2025).
Resource savings: DoMIX achieves an 87% reduction in DAP-stage peak memory and 58% reduction in training time compared to prior continual DAP methods (e.g., DAS), while growing trainable parameters by only $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 130MB per domain ( $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 2 yields $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 33.35% of base model parameters) (Kim et al., 3 Jul 2025).

For LLM-scale models (e.g., Llama3-8B, Gemma2-9B), modular DAP further reduces GPU memory usage by 13–18% and training time by 36–42% compared to baseline PEFTs, without compromising downstream accuracy (Kim et al., 3 Jul 2025).

4. Knowledge Integration and Tailored Model Construction

A key innovation in efficient or continual DAP is the explicit retention and exploitation of domain-specific knowledge modules:

Domain adapters: Instead of merging all knowledge into a single monolithic model, DAP frameworks such as DoMIX maintain separate modules for each domain (Kim et al., 3 Jul 2025).
Task-specific mixing: At downstream fine-tuning, modules can be composed and weighted (via a bridge matrix) to produce models tailored to the input distribution of each task, unlike standard continual learning methods which produce a single generalized model (Kim et al., 3 Jul 2025).
Ablation studies: Removal or freezing of the mixing module $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 4 degrades downstream accuracy by >10 points, confirming the necessity of learnable composition for maximally leveraging accumulated domain knowledge (Kim et al., 3 Jul 2025).

This modular strategy enhances both specialization and domain transfer, circumventing the collapse of domain-unique representations typical in monolithic DAP.

5. Empirical Results and Comparative Evaluation

Benchmarks in continual domain classification demonstrate that:

DoMIX: Achieves average accuracy 81.67% and F1 77.84% under LoRA fine-tuning, outperforming all tested baselines including Separate LoRA, Joint LoRA, NCL, EWC, KD, and DAS (Kim et al., 3 Jul 2025).
Variance to domain order: DoMIX is statistically order-insensitive, with near-zero accuracy variance across all $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 5 random domain permutations tested, whereas other methods vary by 3–5% (Kim et al., 3 Jul 2025).
Task-specific gains: Modular DAP architectures achieve state-of-the-art performance on commonsense and arithmetic reasoning benchmarks (BoolQ, PIQA, SIQA, GSM8K, AQuA), matching or exceeding advanced LoRA variants (DoRA, MoSLoRA) (Kim et al., 3 Jul 2025).
Parameter and cost efficiency: For $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 6 domains, DoMIX requires only 0.03 $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 7 trainable parameters ( $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 8=base model size), maintaining peak GPU memory at 4.235 GiB (vs. 6.25 GiB full fine-tuning) (Kim et al., 3 Jul 2025).

6. Limitations, Scalability, and Open Directions

Known limitations and outstanding challenges for DAP include:

Linear parameter growth: Modular methods (e.g., DoMIX) experience linear growth in LoRA parameters with the number of domains, though each is small. All modules must be stored for future task exploitation (Kim et al., 3 Jul 2025).
Module redundancy: Future research may focus on pruning or merging redundant domain subspaces to further control memory and parameter costs without degrading accuracy (Kim et al., 3 Jul 2025).
Replay corpus construction: In continual or DACP, the need to approximate the base model's original pretraining distribution with a replay buffer is nontrivial, as open-weight models rarely disclose data origin (Kim et al., 9 Jul 2025).
Domain boundaries: Practical extension to interleaved or unbounded domain evolutions—a scenario common in industrial and real-world NLP deployments—remains incompletely addressed. Automated module management, online domain drift detection, and theoretical criteria for subspace importance are open problems.

7. Schematic Summary of DoMIX for Continual DAP

The following table summarizes key DoMIX workflow elements as reported by (Kim et al., 3 Jul 2025):

Phase	Step	Parameter Update Scope
Parallel DAP	For each $\mathcal{L}_{\text{DAP}}(\theta) = \mathbb{E}_{x \sim D_{\text{domain}}}\left[ -\sum_{i \in \mathcal{M}(x)} \log p_\theta(x_i \mid x_{\backslash \mathcal{M}(x)}) \right]$ 9: attach LoRA, optimize on $\mathcal{M}(x)$ 0	$\mathcal{M}(x)$ 1 (base $\mathcal{M}(x)$ 2 frozen)
Exploitation	Concatenate $\mathcal{M}(x)$ 3, insert bridge $\mathcal{M}(x)$ 4, fine-tune on task	$\mathcal{M}(x)$ 5 (all $\mathcal{M}(x)$ 6 frozen)