Papers
Topics
Authors
Recent
2000 character limit reached

PromptFuseNL: Unified Few-Shot Adaptation

Updated 25 November 2025
  • PromptFuseNL is a unified framework that integrates predictive prompt tuning, dual-branch learning, and unsupervised reweighting for both vision-language and language-only scenarios.
  • It utilizes a frozen CLIP backbone with lightweight adaptations to refine class prototypes efficiently and enhance generalization even under label noise.
  • The framework delivers state-of-the-art few-shot accuracy and superior training efficiency, supporting multi-modal adaptation and zero-shot prompt discovery.

PromptFuseNL is a unified framework for robust few-shot adaptation and cross-model prompt optimization in both vision-language and language-only scenarios. It combines predictive prompt tuning, dual-branch positive and negative learning, unsupervised instance reweighting, and zero-shot adapter mechanisms, enabling efficient and accurate generalization even under noisy support data and mismatched tokenizers. The approach yields state-of-the-art few-shot accuracy and training efficiency, supporting multi-modal adaptation, prompt injection fuzzing, and zero-shot prompt discovery across diverse architectures (Mandalika, 16 May 2025, Yu et al., 23 Sep 2024, Williams et al., 9 Aug 2024).

1. Architectural Foundation and Task-Conditioned Residuals

PromptFuseNL builds upon a frozen CLIP backbone featuring a visual encoder fvf_v and text encoder ftf_t, augmented with lightweight textual and visual branches. The central construct is the refinement of class prototypes through learned, task-conditioned residuals: tc=tc+Δθt(tc),vc=vc+Δθv(vc),t_c' = t_c + \Delta_\theta^t(t_c), \qquad v_c' = v_c + \Delta_\theta^v(v_c), where tc=ft(“class name c)Rdt_c = f_t(\text{“class name }c\text{”}) \in \mathbb R^d and vc=fv(template xc)Rdv_c = f_v(\text{template }x_c) \in \mathbb R^d. Visual features are further transformed as

v~c=Wv(LayerNorm(vc))\tilde v_c = W_v\bigl(\mathrm{LayerNorm}(v_c')\bigr)

using a small per-class linear projection WvW_v. Textual and visual prototypes are finally fused: zc+=λ(t~c+v~c)+(1λ)(tc+vc),λ[0,1].z^+_c = \lambda\,( \tilde t_c + \tilde v_c ) + (1-\lambda)\,(t_c + v_c), \qquad \lambda\in[0,1]. This structure allows discriminative adaptation per episode while maintaining backbone parameter efficiency (<<0.1% overhead) (Mandalika, 16 May 2025).

2. Dual-Branch Losses and Semantic Negative Mining

PromptFuseNL introduces a dual-objective training regime:

  • Positive alignment: Queries qq are attracted to their correct class prototype zy+z^+_y using a cosine classifier with temperature scaling:

Lpos=logexp(cos(q,zy+)/τ)c=1Nexp(cos(q,zc+)/τ)\mathcal L_{\rm pos} = -\log \frac{\exp(\cos(q, z^+_y)/\tau)}{\sum_{c=1}^N \exp(\cos(q, z^+_c)/\tau)}

  • Negative repulsion: A hard negative mining procedure selects the KK most confusable prototypes based on similarity to the mean support embedding sˉ\bar s:

N=TopK{csupportcos(sˉ,zc)}\mathcal N = \mathrm{TopK}\left\{c \notin \text{support} \mid \cos(\bar s,z_c)\right\}

Each nNn\in\mathcal N is processed analogously, and a hinge loss is imposed:

Lneg=1NnNmax(0,τcos(q,zn)).\mathcal L_{\rm neg} = \frac{1}{|\mathcal N|} \sum_{n\in\mathcal N} \max(0,\, \tau - \cos(q,\,z^-_n)).

The final classification loss combines positive, negative, and L2 regularization for attention parameters: L=Lpos+Lneg+γθattn22\mathcal L = \mathcal L_{\rm pos} + \mathcal L_{\rm neg} + \gamma\|\theta_{\rm attn}\|_2^2 This arrangement substantially enhances fine-grained class separation and out-of-domain generalization (Mandalika, 16 May 2025).

3. Multi-Stage Cross-Modal Coordination

PromptFuseNL coordinates information in four cascading stages:

  1. Predictive Prompt Tuning: Compute attention logits with a small MLP fϕf_\phi:

α~c=fϕ(tc),αc,i=exp(α~c,i)j=1Sexp(α~c,j)\tilde\alpha_c = f_\phi(t_c), \qquad \alpha_{c,i} = \frac{\exp(\tilde\alpha_{c,i})}{\sum_{j=1}^S \exp(\tilde\alpha_{c,j})}

yielding a prompt token pc=iαc,isip_c=\sum_i \alpha_{c,i}s_i, modifying tct_c'.

  1. Cross-Modal Attention: Refine text prototype by cross-attention over support visuals, producing t^c\hat t_c via CrossAttn(tc,V,V)\mathrm{CrossAttn}(t_c', V, V).
  2. Visual Prototype Adaptation: Weight support examples using wiw_i (instance reweighting), average and add residual rcr_c, then project via WvW_v.
  3. Late Fusion: Fuse refined textual and visual prototypes into zc+z^+_c.

This stratified coordination maximizes discriminative fusion and adaptation, leveraging both contextual and episodic information (Mandalika, 16 May 2025).

4. Unsupervised Instance Reweighting and Label Noise Robustness

To address label noise and outlier contamination in support sets, PromptFuseNL assigns each example a soft reliability score: wi=12[cos(xi,sˉ)+cos(xi,zyi+)]w_i = \tfrac12\left[\cos(x_i, \bar s) + \cos(x_i, z^+_{y_i})\right] where sˉ\bar s is the mean visual embedding and zyi+z^+_{y_i} is the adapted prototype. Instances receiving higher scores contribute more to prototype construction, while unreliable or mislabeled examples are suppressed. This strategy removes the need for auxiliary labels or explicit structural modifications, and empirically delivers +0.3+0.3 to +0.9+0.9 points in accuracy under up to 50% support label corruption (Mandalika, 16 May 2025).

5. Cross-Tokenizer Prompt Discovery via FUSE

PromptFuseNL incorporates FUSE (Flexible Unification of Semantic Embeddings) to support zero-shot prompt optimization across models with mismatched tokenizers and embedding spaces (Williams et al., 9 Aug 2024). This is accomplished by representing each model's vocabulary as a third-order tensor V~RW××d\tilde V \in \mathbb{R}^{|W|\times \ell \times d}, where W|W| is word vocabulary size, \ell is sub-token count, and dd is embedding dimension.

Adapter computation proceeds as follows:

  • For each word-length \ell, compute tensor pseudo-inverse Vi+V_i^+ and adapter map M[]=Vi+VjM[\ell]=V_i^+ * V_j.
  • The forward pass for prompt optimization swaps embeddings across models with:

E~jE~i(Vi+Vj)\tilde E_j \approx \tilde E_i * (V_i^+ * V_j)

  • Backpropagation supports prompt search/editing by transferring gradients:

EiLjmerge((Vi+Vj)split(EjLj))\nabla_{E_i} L_j \approx \text{merge}\left((V_i^+ * V_j) * \text{split}(\nabla_{E_j} L_j) \right)

A PromptFuseNL pipeline can thus utilize any two models AA (generation) and BB (evaluator) by precomputing the word-adapter tensors, initializing prompt beams, quantifying joint loss, and propagating BB's gradients for discrete prompt search—all with fixed reference adapters, requiring no model retraining or fine-tuning (Williams et al., 9 Aug 2024).

6. Benchmark Results and Efficiency Profile

Across 15 major few-shot vision-language benchmarks and several domain generalization tasks, PromptFuseNL demonstrates superior accuracy and resource efficiency (Mandalika, 16 May 2025):

Method 1-shot 2-shot 4-shot 8-shot 16-shot
SimNL (prior SOTA) 67.5% 70.1% 72.4% 75.1% 77.8%
PromptFuseNL 74.3% 78.6% 81.5% 85.1% 88.8%

PromptFuseNL achieves up to 300× faster training (episodes/sec) and 1000× lower compute per episode versus full prompt tuning, facilitated by its low-overhead modules and frozen backbone. Domain generalization (ImageNet \rightarrow V2/Sketch/A/R) shows 50.8% mean accuracy (vs. 45.3% for SimNL), and robustness to substantial label noise is observed without architecture modification or explicit regularization. These results substantiate the framework's scalability and generalization capacity.

7. Applications, Limitations, and Future Directions

PromptFuseNL supports broad application profiles: robust cross-modal few-shot learning, adversarial injection fuzzing, efficient prompt discovery across model/tokenizer boundaries, and scalable adaptation for deployed and research settings (Mandalika, 16 May 2025, Yu et al., 23 Sep 2024, Williams et al., 9 Aug 2024).

Limitations include reliance on frozen backbones, sensitivity to vocabulary selection in FUSE, and the approximation inherent in tensor-based adapters. Expected future work includes:

  • Expansion to dialogue-level and multi-turn prompt morphisms
  • Incorporation of “web injection” scenarios via external content retrieval
  • Iterative red-teaming and fine-tuning loops for adversarial robustness
  • Transfer of prompt optimization across non-English and highly morphologically diverse languages

This suggests PromptFuseNL can serve as a foundational prompt optimization and adaptation platform for heterogeneous, multi-model, and adversarially resistant learning environments.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PromptFuseNL.