Papers
Topics
Authors
Recent
2000 character limit reached

Domain Adaptation via Prompt Learning (DAPL)

Updated 5 November 2025
  • The paper introduces DAPL, embedding domain and class cues into learnable prompt vectors to bypass semantic distortions in unsupervised domain adaptation.
  • It leverages frozen vision-language backbones like CLIP while optimizing only lightweight prompt vectors for rapid, parameter-efficient adaptation.
  • Empirical results on Office-Home and VisDA-2017 demonstrate DAPL's superiority, achieving up to 4.2% higher accuracy compared to traditional feature alignment methods.

Domain Adaptation via Prompt Learning (DAPL) is a paradigm in unsupervised domain adaptation wherein the domain information is embedded into prompt representations, leveraging the capabilities of large-scale pre-trained vision-LLMs such as CLIP. Unlike classical methods that aim for feature space alignment—often at the cost of semantic distortion and discriminability—DAPL dynamically adapts the classifier according to each domain by learning distinct prompts, thereby circumventing the limitations associated with forced feature-level invariance (Ge et al., 2022).

1. Motivation and Theoretical Rationale

Prevailing unsupervised domain adaptation (UDA) methodologies, including statistical discrepancy minimization (MMD, CMD) or adversarial training, explicitly align source and target feature manifolds. While these approaches encourage domain invariance, they introduce the risk of semantic feature structure distortion and concomitant loss of class discriminability, especially in entangled distributions. DAPL capitalizes on prompt learning's context-injection mechanism—which is extensively utilized in NLP and more recently in vision-LLMs—allowing for domain-specific adaptation by encoding context directly in input prompts. This strategy avoids the inherent trade-off in feature alignment between preserving semantics and enforcing invariance.

2. Vision-Language Foundation Model Integration

DAPL operates atop pre-trained vision-language networks, specifically CLIP, which couples a frozen image encoder (ResNet, ViT) and a transformer-based text encoder. During adaptation, only the prompt context vectors are optimized, while the underlying model weights remain fixed. This approach benefits from robust feature representations acquired via large-scale pre-training and enables efficient adaptation through parameter-efficient prompt tuning.

Component Functionality Training Status
Image encoder Extracts visual features Frozen (pre-trained)
Text encoder Embeds prompt representations Frozen (pre-trained)
Prompts Encode domain/class context Learnable

3. Embedding Domain and Class Information in Prompts

DAPL formulates each prompt as a combinatorial sequence of learnable vectors:

pkd=[c1][c2][cM1][c1d][c2d][cM2d][CLASS]kp^d_k = [c_1][c_2]\ldots[c_{M_1}][c^d_1][c^d_2]\ldots[c^d_{M_2}][\mathrm{CLASS}]_k

  • [c1cM1][c_1\ldots c_{M_1}]: domain-agnostic context (shared across all domains).
  • [c1dcM2d][c^d_1\ldots c^d_{M_2}]: domain-specific context (unique for each domain dd, e.g., source or target).
  • [CLASS]k[\mathrm{CLASS}]_k: class identifier token.

This explicit decomposition allows DAPL to decouple class and domain cues. Domain-specific tokens characterize unique aspects of each domain (e.g., lighting, object backgrounds), while the agnostic context ensures generalization across domains.

4. Training Algorithm and Loss Structure

The training process is organized as follows:

  1. Prompt Initialization: Random Gaussian initialization of context vectors.
  2. Source Supervised Training: For labeled source instances (xis,yis)(x^s_i, y^s_i), compute prompt-aligned classification probabilities:

P(y^is=kxis)=exp(g(pks),f(xis)/T)d{s,u}j=1Kexp(g(pjd),f(xis)/T)P(\hat{y}_i^s = k \mid x_i^s) = \frac{ \exp \left( \langle g(p^s_k), f(x^s_i) \rangle / T \right) }{ \sum_{d \in \{s, u\}} \sum_{j=1}^K \exp \left( \langle g(p^d_j), f(x^s_i) \rangle / T \right) }

where gg and ff denote the frozen text and image encoders and TT is the temperature hyperparameter.

  1. Target Pseudo-Labeling: For each unlabeled target instance xux^u, assign pseudolabel yuy^u if model confidence P(y^u=yuxu)τP(\hat{y}^u = y^u \mid x^u) \geq \tau, and incorporate into the supervised loss with an indicator function.
  2. Loss Optimization: Minimize total loss with respect to prompt parameters:

Ltotal=Ls+Lu\mathcal{L}_{\text{total}} = \mathcal{L}_s + \mathcal{L}_u

where

Ls=1Nsi=1NslogP(y^is=yis)\mathcal{L}_s = -\frac{1}{N_s} \sum_{i=1}^{N_s} \log P(\hat{y}_i^s = y_i^s)

Lu=1Nui=1NuI{P(y^iu=yiuxiu)τ}logP(y^iu=yiuxiu)\mathcal{L}_u = -\frac{1}{N_u} \sum_{i=1}^{N_u} \mathbb{I} \{P(\hat{y}_i^u = y^u_i \mid x^u_i) \geq \tau \} \log P(\hat{y}_i^u = y^u_i \mid x^u_i)

This yields prompt vectors specialized to both source and target (pseudo-labeled) data distributions, with all backbone parameters frozen.

5. Efficiency and Practical Considerations

DAPL is computationally and memory efficient, with only prompt vectors (on the order of 10310^3 parameters) being optimized. Training is several times faster than feature-alignment-based approaches owing to the lightweight parameterization. The simplicity of implementation—the prompt insertion as input tokens to the text encoder—removes the need for adversarial discriminators or complex statistical loss balancing, making it widely compatible across vision-language infrastructures.

Method Trainable Params Training Time (VisDA-2017) Backbone Weights
DAPL 5\sim 5K 5.3h Frozen
MCD, DANN, etc. >10>10M 13.4h–38.3h Finetuned

6. Quantitative Results and Comparisons

On standard cross-domain benchmarks such as Office-Home and VisDA-2017, DAPL demonstrates consistent superiority over both feature alignment methods and prompt-tuned CLIP baselines. It attains a mean accuracy of 74.5% on Office-Home and 86.9% on VisDA-2017, outperforming SOTA alternatives by 2.5–4.2 percentage points. Specific class-level improvements (e.g., up to +14% for "knife" and "plant" in VisDA) highlight DAPL's discriminative strength in challenging transfer scenarios. Ablation studies reveal the critical impact of domain-specific tokens; omitting them results in marked accuracy drops.

7. Role in Unsupervised Domain Adaptation Methodology

DAPL establishes a distinct paradigm within UDA, leveraging prompt learning rather than explicit feature alignment. By dynamically constructing prompts that encode both domain and class information, DAPL maintains semantic structure and class discriminability even as data distributions shift. Its efficiency, effectiveness, and ease of implementation represent salient advantages and set a new benchmark in parameter-efficient domain adaptation strategies.

Summary Table: DAPL Core Features and Equations

Aspect Form / Description
Prompt composition pkd=[c1][c2][cM1][c1d][c2d][cM2d][CLASS]kp^d_k = [c_1][c_2]\ldots[c_{M_1}][c^d_1][c^d_2]\ldots[c^d_{M_2}][\mathrm{CLASS}]_k
Class probability P(y^is=kxis)P(\hat{y}_i^s = k | x^s_i): cosine similarity with prompt, temperature-scaled softmax
Training loss Ltotal=Ls+Lu\mathcal{L}_{\text{total}} = \mathcal{L}_s + \mathcal{L}_u
Efficiency Optimize only prompt vectors, frozen backbone, rapid convergence
Empirical result highlight +4.2% accuracy over previous SOTA (VisDA), effective pseudo-labeling for target domain

DAPL redefines unsupervised domain adaptation through prompt learning, preserving semantic integrity and achieving state-of-the-art results on major benchmarks with highly compact parameterization (Ge et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Domain Adaptation via Prompt Learning (DAPL).