Prmpt2Adpt: Zero-Shot Domain Adaptation
- The paper introduces the Prmpt2Adpt framework, which uses prompt-driven feature steering to enable zero-shot domain adaptation in resource-constrained scenarios.
- It employs prompt-driven instance normalization with a distilled CLIP backbone to align source features using textual prompts, bypassing the need for target images.
- The framework achieves up to 7x faster adaptation and 5x faster inference, making it ideal for real-time applications on devices like drones.
The Prmpt2Adpt (Prompt-to-Adapter) framework is a lightweight, zero-shot domain adaptation pipeline optimized for resource-const1ed scenarios, notably real-time vision systems such as drones that face rapid distributional shifts and have limited memory or computational headroom (Farrukh et al., 20 Jun 2025). It advances unsupervised domain adaptation (UDA) by leveraging prompt-driven semantic alignment rather than conventional access to target or source domain images during adaptation. At its core, Prmpt2Adpt implements prompt-based feature steering through instance normalization, a distilled CLIP backbone, and a teacher-student transfer via pseudo-labeling, achieving significant speed and hardware efficiency advantages over prior art.
1. Architectural Overview
Prmpt2Adpt employs a three-stage pipeline integrating vision-language semantics and rapid detector adaptation:
- Distilled CLIP Backbone: A compact, distilled CLIP model (TinyCLIP [Wu et al., ICCV 2023]) is fine-tuned as the backbone for a Faster R-CNN teacher detector. The CLIP image encoder remains frozen post fine-tuning.
- Prompt-driven Instance Normalization (PIN): Source domain features from a minimal image cache are modulated by learned normalization parameters optimized to align with a CLIP embedding of a domain prompt that describes the unseen target scenario.
- Teacher-Student Paradigm: The teacher, adapted only via the detection head, generates high-quality pseudo-labels for the lightweight YOLOv11 nano student model, which is rapidly fine-tuned for inference in the target domain.
This workflow bypasses the need for target-domain images at adaptation time, relying exclusively on prompted domain descriptions and a handful of cached source image features.
2. Prompt-driven Semantic Alignment
Text-Guided Feature Steering
Prmpt2Adpt addresses domain shift through direct CLIP-mediated semantic alignment using Prompt-driven Instance Normalization (PIN). This module adjusts the low-level source features by optimizing channel-wise mean and variance parameters such that the resultant feature embedding aligns (in CLIP's joint vision-language space) with the target domain prompt embedding.
Given:
- : source features from Layer1 of the frozen CLIP encoder,
- : target domain description (natural language prompt),
- : CLIP text embedding for ,
PIN computes:
The adaptation loss is the cosine distance between the embedding of the transformed feature and the prompt:
are iteratively optimized per feature map, using only —no target images are required.
Efficient Feature Rewriting
A small cache (e.g., 5) of source features suffices for PIN adaptation. The resultant PIN-steered features are used to fine-tune only the detection head of the teacher detector.
3. CLIP Distillation and Captioning Protocol
To ensure hardware compatibility, the CLIP backbone is pre-compressed using the TinyCLIP protocol. This involves distillation to ResNet-19M (image encoder) and Text-19M (text encoder) scales, followed by domain-specific fine-tuning on aerial and drone imagery with prompt-enhanced captions.
An automated captioning pipeline (built on LLaMA-3.2 Vision) systematically drafts standardized, environment/objective-centric prompts, e.g., "urban aerial scene, rainy, night," thereby enforcing semantic fidelity for zero-shot adaptation.
4. Teacher-Student Knowledge Transfer
After prompt-driven adaptation, the teacher (Faster R-CNN with PIN-steered features) is applied to target domain images (during deployment); its predictions serve as pseudo-labels for the YOLOv11 nano student. The student is online fine-tuned on these soft annotations, then deployed for independent target-domain inference.
This mechanism enables rapid, real-time adaptation and inference with minimal resource demand—critical for embedded/edge systems.
5. Experimental Results and Efficiency Metrics
Accuracy and Resource Trade-offs
On the MDS-A dataset (domains: "rain", "snow", "fog", "dust", "leaves"), Prmpt2Adpt demonstrates:
- Competitive mAP@50 vs. SOTA methods PODA and ULDA, with only a 3–4% average drop in mean Average Precision:
| Method | Snow | Dust | Leaves | Rain | Fog | |------------------------|------|------|--------|------|-----| | PODA (ResNet-50) | 65.7 | 67.4 | 44.1 | 64.0 | 55.9| | ULDA (ResNet-50) | 67.3 | 70.1 | 45.6 | 66.2 | 58.1| | Prmpt2Adpt (YOLO) | 63.6 | 64.5 | 41.2 | 62.8 | 53.0|
- Adaptation time: Up to 7x faster than PODA/ULDA.
- Inference speed: Up to 5x faster, enabled by a student model and distilled backbone.
- Data efficiency: PIN adaptation operates on five cached source images; PODA/ULDA require full source sets.
- Memory footprint: Substantially reduced.
Practical Significance
Prmpt2Adpt enables real-time, on-device adaptation for low-power platforms, under regimes where typical UDA approaches are inapplicable due to resource policies or data availability.
6. Comparison to Prior Work
Contrasted with prompt-driven DA baselines that use heavyweight VLMs and require access to full source image datasets [PODA, ULDA], Prmpt2Adpt achieves similar semantic generalization using minimal state and consumables—notably, PIN allows pure text-driven adaptation of source features, and downstream transfer is handled via resource-minimal pseudo-labeling.
7. Deployment and Limitations
Prmpt2Adpt is uniquely suited to fielded AI systems requiring robust adaptation in the absence of source and target data acquisition—for example, drones facing operational shifts (weather, lighting, geography). The semantic generalization offered by textual prompts creates a practical path for human-in-the-loop domain steering.
A notable limitation is modest accuracy degradation relative to very large SOTA models; however, the framework's balance of efficiency, speed, and competitive accuracy sets a new state-of-the-art for hardware-constrained UDA.
Prmpt2Adpt establishes prompt-driven feature alignment with PIN as a viable, efficient, and domain-agnostic adaptation circuit, extending zero-shot domain adaptation to operationally relevant, real-world, low-power platforms for vision tasks (Farrukh et al., 20 Jun 2025).