Domain-Adaptive Pre-Training

Updated 17 November 2025

Domain-adaptive pre-training is a strategy that further pre-trains generic models on target-domain data to align internal representations with in-domain semantics.
It employs techniques like keyword-centric masking and modified loss functions to enhance performance on tasks such as radiology summarization, classification, and structured conversations.
The method preserves the original model architecture while addressing challenges like catastrophic forgetting and computational efficiency, making it a practical precursor to fine-tuning.

Domain-adaptive pre-training (DAPT) is an intermediate adaptation strategy in which a pre-trained model—often language, vision, or multimodal—is further pre-trained on unlabeled data from a target domain before supervised fine-tuning. This process injects domain knowledge into the model’s representations, aligning them with in-domain semantics and shifting the model’s prior in favor of the target data distribution while retaining generalization capabilities from the broad original pre-training. DAPT variants span supervised, self-supervised, and instruction-tuned settings, have been applied to text, images, and audio, and employ both task-agnostic (e.g., masked language modeling) and task-centric (e.g., keyword-centric masking, domain-specific heads, or continual instruction) objectives. Notable advances address catastrophic forgetting, computational efficiency, and integration with instruction tuning, with strong empirical performance on radiology report summarization, domain-specific classification, and structured business conversations.

1. Foundations and Objectives

In DAPT, a generic pre-trained model is exposed to unlabeled, task-relevant data from the target domain to minimize the mismatch between pre-training (source) and deployment (target) distributions without supervised labels from the target domain. The objective function is typically inherited from the original pre-training—causal language modeling (for GPT/BLOOM-style transformers), masked language modeling (for BERT-style), or masked image modeling (for ViT derivatives)—but applied to the in-domain corpus exclusively: $\mathcal{L}_{\mathrm{DAPT}} = -\sum_{t=1}^T \log p_\theta(x_t | x_{<t})$ for auto-regressive models, or

$\mathcal{L}_{\mathrm{MLM}} = -\sum_{i\in M} \log p_\theta(x_i | x_{\setminus M})$

for masked objectives.

The DAPT phase leverages three main design elements:

Domain corpus construction: Filtering, deduplication, and curation of relevant unlabeled in-domain data (e.g., radiology reports (Karn et al., 2023), medical images (Roth et al., 2024), annotated excerpts).
Pre-training objectives: The standard loss can be modified for domain-specialization (e.g., masking only domain keywords (Golchin et al., 2023), inclusion of contrastive or distillation terms (Ke et al., 2023)).
Integration with prompt or instruction tuning: DAPT may follow instruction-tuning stages, especially in LLMs (e.g., BLOOM → BLOOMz → RadBloomz (Karn et al., 2023)).

The key aim is to align internal representations and tokenizer vocabularies with domain-specific concepts, terms, and structure, thereby improving downstream transfer, often in zero-shot or low-shot regimes.

2. Methodological Variants and Optimization Strategies

2.1 Architecture-Preserving Adaptation

DAPT is typically implemented without architectural modifications. Pre-trained weights are updated, but model structure, including tokenizer, positional encodings, and head config, remain fixed. For instance, instruction-tuned BLOOMz-7b1 is further pre-trained on MIMIC-IV radiology reports, using special section-marker tokens but no layer changes (Karn et al., 2023).

2.2 Data Preprocessing and Curation

Radiology DAPT: Extraction and cleaning of 1.4M reports, section-based segmentation, and discarding of long or incomplete documents (Karn et al., 2023).
Keyword-centric masking: In-domain terms are extracted with KeyBERT via contextualized embedding cosine similarity and filtered globally by frequency elbows, restricting masking to domain-carrying words (Golchin et al., 2023).
Medical image DAPT: Class-stratified splits, patient-ID–preserving protocol, and dynamic class remapping support fine-grained adaptation (Roth et al., 2024).

2.3 Loss Modifications and Advanced Objectives

Causal/Masked Language Modeling: Applied directly to sequences with token masking (15–75% masking ratio, higher when keywords are prioritized).
Selective Masking: p=0.75 masking on in-domain keywords vs. 0.15, shown to yield consistent >0.3–1.2pp gains in accuracy and F1 over random masking baselines (Golchin et al., 2023).
Contrastive/Distillation Terms: KL-divergence between teacher and student features (used to distill from semantically-superior or more general models (Roth et al., 2024)) is weighted along with the main reconstruction error.
Layer Freezing for Catastrophic Forgetting: When fine-tuning RadBloomz on task-specific pairs, only the final transformer block is unfrozen to preserve acquired knowledge (Karn et al., 2023).

2.4 Training Schedule Considerations

Step count and validation: Zero-shot performance on radiology summarization saturates within ≈24k DAPT steps, with incremental value quickly diminishing thereafter (Karn et al., 2023).
Hardware/Parallelism: Multi-GPU (8×A100 80GB), DeepSpeed ZeRO-3 optimization, and BF16 precision are used for RadBloomz; large batch sizes (64–128 sequences) are common in all studies (Karn et al., 2023, Roth et al., 2024).

3. Empirical Performance and Comparative Analysis

DAPT consistently enhances performance on specialized downstream tasks, often even without supervised fine-tuning:

Radiology Impression Generation (RadBloomz):
- Zero-shot: BLEU-4=16.49, ROUGE-L=35.25, BERTScore=57.29, F1-RadGraph=31.12 (open test, MIMIC-III).
- Supervised fine-tuning: BLEU-4=25.32 (+8.83), ROUGE-L=47.48 (+12.23), BERTScore=63.61 (+6.32), F1-RadGraph=49.00 (+17.88).
- Surpasses all pretrain-and-finetune baselines in zero-shot mode (Karn et al., 2023).
Keyword-enriched DAPT:
- BERT-Large: Absolute gains up to +1.0–1.2pp over random masking in accuracy/F1; statistically significant in most benchmarks (p≤0.05 in 83% tests) (Golchin et al., 2023).
Medical Image DAPT (GIE, EVA-02):
- Macro AUC=0.762, Balanced ACC=37.1% (Capsule Endoscopy 2024), substantially ahead of strong non-adapted baselines (Roth et al., 2024).
- 5–7pp gains by DAPT over ImageNet-only pre-training, validated by downstream classification improvements (ACC: 0.810 → 0.893).
Comparison and Saturation:
- In both language and vision, incremental DAPT leads to rapid performance saturation; major gains are realized within a relatively small number of adaptation steps when starting from instruction-tuned or strong general-domain checkpoints (Karn et al., 2023, Roth et al., 2024).

4. Practical Implementation Guidelines

Step	Recommendation	Reference
Corpus curation	Deduplicate, clean headers/footers, filter incomplete/long docs	(Karn et al., 2023)
Tokenization	Domain-specific vocabularies; retain de-identified tokens	(Karn et al., 2023)
Special tokens	Use section markers, custom delimiters for structural cues	(Karn et al., 2023)
Masking strategy	In critical domains, mask high-frequency domain terms, p≈0.75	(Golchin et al., 2023)
Loss function	Standard cross-entropy; mix in distillation or keyword loss	(Roth et al., 2024)
Training config	Batch size 64–128; AdamW optimizer; LR 1e-6–3e-5; 1–3 epochs	(Roth et al., 2024)
Catastrophic forgetting	Freeze all but final layer on downstream supervised fine-tuning	(Karn et al., 2023)
Compute/hardware	Large-scale (A100-class GPUs), BF16, DeepSpeed, multi-GPU	(Karn et al., 2023)

DAPT is generally recommended as a drop-in step between general pre-training (on web-scale or generic corpora) and task-specific fine-tuning, particularly for domains with scarce label resources but abundant raw text or images.

5. Strengths, Limitations, and Future Directions

Strengths

Zero-shot superiority: RadBloomz’s zero-shot performance exceeds all pretrain-then-finetune competition in radiology summarization (Karn et al., 2023).
Task-agnostic efficiency: In-domain keyword masking applies regardless of downstream task, making DAPT workflow easily reusable across settings (Golchin et al., 2023).
Minimal architectural changes: Transformer architecture and tokenizers are left intact, reducing risk and implementation complexity (Karn et al., 2023, Roth et al., 2024).
Scaling properties: DAPT benefits scale with domain data size and saturate rapidly if starting from a strong base model.

Limitations

Domain and demographic bias: Medical DAPT experiments are English- and adult-centric; domain shifts across institutions, writing styles, and patient demographics are not addressed (Karn et al., 2023).
Model size constraints: The 7B-parameter regime is computationally burdensome; not immediately amenable for edge deployment without further model compression (Karn et al., 2023).
Absence of style modeling/uncertainty: Variation in radiologist style, specificity, and report structuring is not explicitly modeled.
Generalization to multilingual/multi-institution settings: DAPT methods require explicit adaptation to exploit non-English corpora and alternative report conventions.

Future Work

Exploration of quantization, pruning, and distillation for resource-constrained deployment (Karn et al., 2023).
Extension of DAPT to non-English, multi-institutional, and multi-modal settings (structured+unstructured, imaging+text) (Roth et al., 2024).
Automatic inference or synthesis of document structure (for improved section identification and context segmentation).
Integration of uncertainty quantification, style transfer, and interpretability modules.

6. Best Practices and Workflow Summary

Base model selection: Begin with an instruction-tuned or powerful general-purpose checkpoint to maximize adaptation efficiency.
Corpus assembly: Filter and deduplicate in-domain data; enforce presence of critical sections if applicable (e.g., FINDINGS/IMPRESSIONS).
Tokenization: Use domain-augmented vocabularies, retaining domain-unique markers (de-identification, section headers).
Sectional/structural markers: Prepend explicit delimiters to inputs to direct attention to key domain context.
Masking and loss: For highly structured domains, prioritize masking domain-specific tokens over random masking for improved learning efficiency and downstream gains.
Training and tuning: Optimize with standard or slightly reduced learning rates, use moderate batch sizes, and employ layer freezing or regularization during supervised fine-tuning to mitigate forgetting.
Evaluation: Employ both exact-match and embedding-based metrics (BLEU, ROUGE, BERTScore, domain-specific F1), and validate on hidden test or out-of-domain splits for full robustness assessment.

In summary, domain-adaptive pre-training is a robust, highly effective, and computationally practical step to endow large pre-trained models with domain knowledge, yielding state-of-the-art performance—often in a zero-shot setting—and establishing a paradigm for Data-Centric AI in specialized application domains (Karn et al., 2023, Golchin et al., 2023, Roth et al., 2024).

PDF Markdown Chat (Pro)

References (4)

shs-nlp at RadSum23: Domain-Adaptive Pre-training of Instruction-tuned LLMs for Radiology Report Impression Generation (2023)

Domain-Adaptive Pre-training of Self-Supervised Foundation Models for Medical Image Classification in Gastrointestinal Endoscopy (2024)

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords (2023)

Adapting a Language Model While Preserving its General Knowledge (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Domain-Adaptive Pre-Training.

Domain-Adaptive Pre-Training

1. Foundations and Objectives

2. Methodological Variants and Optimization Strategies

2.1 Architecture-Preserving Adaptation

2.2 Data Preprocessing and Curation

2.3 Loss Modifications and Advanced Objectives

2.4 Training Schedule Considerations

3. Empirical Performance and Comparative Analysis

4. Practical Implementation Guidelines

5. Strengths, Limitations, and Future Directions

Strengths

Limitations

Future Work

6. Best Practices and Workflow Summary

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Domain-Adaptive Pre-Training

1. Foundations and Objectives

2. Methodological Variants and Optimization Strategies

2.1 Architecture-Preserving Adaptation

2.2 Data Preprocessing and Curation

2.3 Loss Modifications and Advanced Objectives

2.4 Training Schedule Considerations

3. Empirical Performance and Comparative Analysis

4. Practical Implementation Guidelines

5. Strengths, Limitations, and Future Directions

Strengths

Limitations

Future Work

6. Best Practices and Workflow Summary

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research