Retrieval-augmented Prompt Learning for Pre-trained Foundation Models

Published 23 Dec 2025 in cs.CL, cs.AI, cs.CV, cs.IR, and cs.LG | (2512.20145v1)

Abstract: The pre-trained foundation models (PFMs) have become essential for facilitating large-scale multimodal learning. Researchers have effectively employed the ``pre-train, prompt, and predict'' paradigm through prompt learning to induce improved few-shot performance. However, prompt learning approaches for PFMs still follow a parametric learning paradigm. As such, the stability of generalization in memorization and rote learning can be compromised. More specifically, conventional prompt learning might face difficulties in fully utilizing atypical instances and avoiding overfitting to shallow patterns with limited data during the process of fully-supervised training. To overcome these constraints, we present our approach, named RetroPrompt, which aims to achieve a balance between memorization and generalization by decoupling knowledge from mere memorization. Unlike traditional prompting methods, RetroPrompt leverages a publicly accessible knowledge base generated from the training data and incorporates a retrieval mechanism throughout the input, training, and inference stages. This enables the model to actively retrieve relevant contextual information from the corpus, thereby enhancing the available cues. We conduct comprehensive experiments on a variety of datasets across natural language processing and computer vision tasks to demonstrate the superior performance of our proposed approach, RetroPrompt, in both zero-shot and few-shot scenarios. Through detailed analysis of memorization patterns, we observe that RetroPrompt effectively reduces the reliance on rote memorization, leading to enhanced generalization.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a retrieval-augmented approach that decouples knowledge acquisition from rote memorization in PFMs.
It employs a dynamic knowledge-store with dense retrieval, kNN-guided training, and output interpolation to improve zero-shot and few-shot performance.
Empirical evaluations on NLP and vision benchmarks reveal enhanced robustness, reduced performance variance, and improved handling of long-tail data.

Retrieval-Augmented Prompt Learning for Pre-trained Foundation Models

Motivation and Context

Pre-trained foundation models (PFMs) are central to multimodal learning, supporting diverse domains including NLP and computer vision. The prompt learning paradigm enables effective task adaptation, particularly in data-scarce settings. However, conventional prompt learning is fundamentally parametric, which can result in compromised stability in generalization and an over-reliance on rote memorization, especially when handling atypical or rare examples and shallow data distributions.

To address these limitations, this paper presents a retrieval-augmented framework designed to decouple true knowledge acquisition from mere memorization. The method introduces a knowledge-store—an indexed key-value database constructed from training data embeddings—that is accessible across input, training, and inference. By enabling active retrieval of semantically similar instances, the system achieves a harmonized balance between memorization and generalization.

Figure 1: The framework decouples knowledge from memorization by constructing a retrievable knowledge-store, enabling dynamic reference during training and inference.

Methodological Framework

The core architecture consists of three components: a dense retriever for semantic indexing, $k$ NN-guided training, and $k$ NN-output interpolation for inference. The knowledge-store utilizes contextually encoded key-value tuples, where keys are prompt-based embeddings from the PFM encoder (updated dynamically as model parameters change) and values are associated label words.

For efficient lookup, the method leverages scalable inner product search via FAISS, ensuring real-time neighbor retrieval. The training pipeline incorporates $k$ NN-derived difficulty signals by weighting the cross-entropy loss with focal-style adjustments based on instance classification correctness and neighbor distribution softmax probabilities. During inference, predictions are interpolated between the parametric model output and the non-parametric nearest-neighbor distribution, with a scaling factor controlling the degree of retrieval reliance.

For NLP, neural demonstrations—aggregated weighted neighbor embeddings—are concatenated at the embedding layer to boost class analogical learning. In computer vision, CLIP-based prompt learning is further augmented with retrieval from few-shot supervisions, restricting model training to prompt optimization rather than full encoder retraining.

Figure 2: Schematic illustration of the retrieval-augmented prompting mechanism and flow through dense retriever, $k$ NN training guidance, and $k$ NN output interpolation.

Empirical Evaluation and Results

Extensive experiments demonstrate the efficacy of the proposed approach in both zero-shot and few-shot contexts over a suite of NLP and CV benchmarks:

Few-shot Language Understanding: On tasks ranging from single-sentence classification (SST-2, MR, CR) to multi-class information extraction (FewNERD, SemEval, TACRED), retrieval-augmented prompt learning consistently surpasses state-of-the-art baseline prompt tuning (KnowPrompt, LM-BFF, KPT). Notably, the method exhibits reduced performance variance, improved generalization, and enhanced handling of long-tail distribution samples.
Few-shot Visual Understanding: On nine public image classification datasets, the approach outperforms both CoOp and zero-shot CLIP, demonstrating clear advantages in scenarios with extremely limited training data. For example, the method achieves +10.48% over CoOp on FGVCAircraft and +2.67% on Food101 with only two training shots.
Figure 3: Comparative results on nine image classification datasets for zero-shot and few-shot settings, demonstrating gains from $k$ NN-guided training and inference.
Zero-shot Performance: The system exhibits robust zero-shot transfer using unlabeled training data and pseudo-labeling for datastore construction, outperforming dedicated self-training approaches (LOTClass) and knowledge-augmented methods (KPT).
Fully-supervised Regimes: Retrieval-augmented prompting maintains competitive accuracy even in fully-supervised settings with pronounced long-tail distributions, validating its generalization by reducing overfitting to atypical instances.
Figure 4: Performance comparison across fully-supervised paradigms, highlighting improved instance assimilation in long-tailed datasets.

Analysis of Memorization and Generalization

To probe deeper into the memorization mechanisms of PFMs, the paper incorporates influence function-based measures to quantify the memorization score for training samples. Results indicate a disproportionately higher memorization score for atypical examples, suggesting that conventional prompting and fine-tuning strategies favor recall over associative generalization. The retrieval-augmented method, by contrast, yields substantially lower memorization scores, indicating successful decoupling and superior generalization behavior.

Further, case analyses on the ImageNet dataset visualize retrieval impacts, confirming that $k$ NN consistently augments both prediction correctness and grounding for hard instances. Instances where $k$ NN predictions fail exert minimal adverse impact due to high underlying ground-truth probabilities in neighbor distributions.

Figure 5: Visualization of top-5 retrieved neighbors and prediction probability shifts on ImageNet samples.

Ablation and Component Analysis

Component ablations demonstrate that each retrieval mechanism— $k$ NN in training, $k$ NN in inference, neural demonstrations, and dynamic knowledge-store refresh—contributes additively to overall model performance. The prompt-based key features and inner-product similarity for $k$ NN retrieval yield optimal results compared to CLS token or BM25-based alternatives.

Implications and Outlook

This retrieval-augmented prompting strategy establishes a compelling direction for model generalization, particularly under resource-constrained or distribution-shifting conditions. It affirms that explicit caching and retrievable contextualization of training signals empower PFMs to transcend the limitations of rote memorization, enhancing adaptability in both language and vision domains.

Practically, the method invites straightforward integration with existing prompt-tuning pipelines, offers low-bar adaptation for multimodal architectures, and is especially advantageous for settings where updating or extending knowledge is frequent. The retrieval overhead, however, introduces computational complexity and potential scalability issues for extreme-scale PFMs, necessitating further systems-level optimizations.

Theoretically, the decoupling of knowledge and memorization via non-parametric retrieval mechanisms suggests new research in hybrid learning architectures and the development of robust analogical reasoning modules across modalities.

Conclusion

Retrieval-augmented prompt learning provides a principled solution for decoupling knowledge from memorization in PFMs. By integrating dynamic retrieval during input, training, and inference, this framework consistently enhances generalization across NLP and CV tasks in zero-shot, few-shot, and fully-supervised regimes. Future studies may extend the approach to generative modeling, investigate scalability for ultra-large foundation models, and evaluate cross-lingual and multi-modal generalization.

(2512.20145)

Markdown