PromptFuseNL: Unified Few-Shot Adaptation
- PromptFuseNL is a unified framework that integrates predictive prompt tuning, dual-branch learning, and unsupervised reweighting for both vision-language and language-only scenarios.
- It utilizes a frozen CLIP backbone with lightweight adaptations to refine class prototypes efficiently and enhance generalization even under label noise.
- The framework delivers state-of-the-art few-shot accuracy and superior training efficiency, supporting multi-modal adaptation and zero-shot prompt discovery.
PromptFuseNL is a unified framework for robust few-shot adaptation and cross-model prompt optimization in both vision-language and language-only scenarios. It combines predictive prompt tuning, dual-branch positive and negative learning, unsupervised instance reweighting, and zero-shot adapter mechanisms, enabling efficient and accurate generalization even under noisy support data and mismatched tokenizers. The approach yields state-of-the-art few-shot accuracy and training efficiency, supporting multi-modal adaptation, prompt injection fuzzing, and zero-shot prompt discovery across diverse architectures (Mandalika, 16 May 2025, Yu et al., 23 Sep 2024, Williams et al., 9 Aug 2024).
1. Architectural Foundation and Task-Conditioned Residuals
PromptFuseNL builds upon a frozen CLIP backbone featuring a visual encoder and text encoder , augmented with lightweight textual and visual branches. The central construct is the refinement of class prototypes through learned, task-conditioned residuals: where and . Visual features are further transformed as
using a small per-class linear projection . Textual and visual prototypes are finally fused: This structure allows discriminative adaptation per episode while maintaining backbone parameter efficiency (0.1% overhead) (Mandalika, 16 May 2025).
2. Dual-Branch Losses and Semantic Negative Mining
PromptFuseNL introduces a dual-objective training regime:
- Positive alignment: Queries are attracted to their correct class prototype using a cosine classifier with temperature scaling:
- Negative repulsion: A hard negative mining procedure selects the most confusable prototypes based on similarity to the mean support embedding :
Each is processed analogously, and a hinge loss is imposed:
The final classification loss combines positive, negative, and L2 regularization for attention parameters: This arrangement substantially enhances fine-grained class separation and out-of-domain generalization (Mandalika, 16 May 2025).
3. Multi-Stage Cross-Modal Coordination
PromptFuseNL coordinates information in four cascading stages:
- Predictive Prompt Tuning: Compute attention logits with a small MLP :
yielding a prompt token , modifying .
- Cross-Modal Attention: Refine text prototype by cross-attention over support visuals, producing via .
- Visual Prototype Adaptation: Weight support examples using (instance reweighting), average and add residual , then project via .
- Late Fusion: Fuse refined textual and visual prototypes into .
This stratified coordination maximizes discriminative fusion and adaptation, leveraging both contextual and episodic information (Mandalika, 16 May 2025).
4. Unsupervised Instance Reweighting and Label Noise Robustness
To address label noise and outlier contamination in support sets, PromptFuseNL assigns each example a soft reliability score: where is the mean visual embedding and is the adapted prototype. Instances receiving higher scores contribute more to prototype construction, while unreliable or mislabeled examples are suppressed. This strategy removes the need for auxiliary labels or explicit structural modifications, and empirically delivers to points in accuracy under up to 50% support label corruption (Mandalika, 16 May 2025).
5. Cross-Tokenizer Prompt Discovery via FUSE
PromptFuseNL incorporates FUSE (Flexible Unification of Semantic Embeddings) to support zero-shot prompt optimization across models with mismatched tokenizers and embedding spaces (Williams et al., 9 Aug 2024). This is accomplished by representing each model's vocabulary as a third-order tensor , where is word vocabulary size, is sub-token count, and is embedding dimension.
Adapter computation proceeds as follows:
- For each word-length , compute tensor pseudo-inverse and adapter map .
- The forward pass for prompt optimization swaps embeddings across models with:
- Backpropagation supports prompt search/editing by transferring gradients:
A PromptFuseNL pipeline can thus utilize any two models (generation) and (evaluator) by precomputing the word-adapter tensors, initializing prompt beams, quantifying joint loss, and propagating 's gradients for discrete prompt search—all with fixed reference adapters, requiring no model retraining or fine-tuning (Williams et al., 9 Aug 2024).
6. Benchmark Results and Efficiency Profile
Across 15 major few-shot vision-language benchmarks and several domain generalization tasks, PromptFuseNL demonstrates superior accuracy and resource efficiency (Mandalika, 16 May 2025):
| Method | 1-shot | 2-shot | 4-shot | 8-shot | 16-shot |
|---|---|---|---|---|---|
| SimNL (prior SOTA) | 67.5% | 70.1% | 72.4% | 75.1% | 77.8% |
| PromptFuseNL | 74.3% | 78.6% | 81.5% | 85.1% | 88.8% |
PromptFuseNL achieves up to 300× faster training (episodes/sec) and 1000× lower compute per episode versus full prompt tuning, facilitated by its low-overhead modules and frozen backbone. Domain generalization (ImageNet V2/Sketch/A/R) shows 50.8% mean accuracy (vs. 45.3% for SimNL), and robustness to substantial label noise is observed without architecture modification or explicit regularization. These results substantiate the framework's scalability and generalization capacity.
7. Applications, Limitations, and Future Directions
PromptFuseNL supports broad application profiles: robust cross-modal few-shot learning, adversarial injection fuzzing, efficient prompt discovery across model/tokenizer boundaries, and scalable adaptation for deployed and research settings (Mandalika, 16 May 2025, Yu et al., 23 Sep 2024, Williams et al., 9 Aug 2024).
Limitations include reliance on frozen backbones, sensitivity to vocabulary selection in FUSE, and the approximation inherent in tensor-based adapters. Expected future work includes:
- Expansion to dialogue-level and multi-turn prompt morphisms
- Incorporation of “web injection” scenarios via external content retrieval
- Iterative red-teaming and fine-tuning loops for adversarial robustness
- Transfer of prompt optimization across non-English and highly morphologically diverse languages
This suggests PromptFuseNL can serve as a foundational prompt optimization and adaptation platform for heterogeneous, multi-model, and adversarially resistant learning environments.