Text-Prompted Foundation Models

Updated 2 December 2025

Text-Prompted Foundation Models are large-scale neural networks controllable via discrete, continuous, or hybrid text prompts, enabling zero-shot and few-shot adaptation.
They employ domain-general pretraining on massive multimodal datasets with optimized prompt interfaces to steer model performance without full retraining.
Empirical evidence shows that refined prompt engineering improves robustness and accuracy across various tasks in vision, NLP, audio, and graph applications.

A text-prompted foundation model is a large, pre-trained neural model whose behavior is controllable via text prompts—discrete, continuous, or hybrid textual inputs—enabling zero-shot, few-shot, or weakly supervised adaptation to new tasks, modalities, and semantics without full model retraining. This paradigm has become central to contemporary approaches across vision, language, audio, and multimodal domains, leveraging pretraining on massive data and intricate alignment between modalities, with model outputs steered by the design and optimization of text prompts. Text-prompted foundation models are distinguished by their ability to (1) leverage domain-general pretraining through carefully engineered prompt interfaces; (2) exhibit versatility across modalities; and (3) facilitate rapid adaptation or robust generalization (with or without gradient-based prompt tuning) (Mittal et al., 2023, Li et al., 17 Feb 2025, Zhu et al., 14 Oct 2024, Luo et al., 13 Jun 2025, Yu et al., 2022, Stewart et al., 26 Aug 2024).

1. Foundational Principles of Text-Prompted Models

Text-prompted foundation models derive their name from the fact that their inference and, in some cases, training behavior can be influenced or controlled directly via prompts—free-form or template-based text inputs. Foundation models (FM) are trained on web-scale multimodal or monomodal data, acquiring a combination of contrastive, generative, and alignment capabilities. Prompting, as a paradigm, encompasses:

Discrete prompts: strings or token sequences, for hard instructions, few-shot exemplars, or question templates.
Continuous (soft) prompts: learnable embedding vectors prepended or injected into model inputs, optimizing task utility without updating the core model parameters (Li et al., 17 Feb 2025, Mittal et al., 2023).
Hybrid prompts: combinations of discrete tokens and continuous vectors.

Formally, prompt engineering is cast as an optimization problem: $P^* = \arg\max_{P\in\mathcal{P}} \mathbb{E}_{(x, y)\sim\mathcal{D}} [g(f_\theta(P(x)), y)]$ with $f_\theta$ the frozen FM, $P$ the prompt (from search space $\mathcal{P}$ ), and $g$ a task metric (Li et al., 17 Feb 2025).

Prompt design steers the model by conditioning its input space, with semantics, format, and specificity of the prompt mapping onto observed adaptation and performance. Text prompts may specify classes, slot templates, reasoning chains, or domain-specific context, acting as latent programmatic queries to the FM (Yu et al., 2022, Mittal et al., 2023).

Text-prompted FMs subsume a variety of architectures enabling prompt-driven adaptation:

Dual-encoder models: These encode text and another modality (image, graph, audio) into a shared embedding space, aligned by contrastive training. Zero-shot classification is then performed by comparing the output for a text prompt ("a photo of a [CLASS]") with the query image or graph. This design underpins CLIP, CoCa, GraphCLIP (Zhu et al., 14 Oct 2024, Yu et al., 2022).
Encoder-decoder models: These treat text prompts as an unconditional or conditional input to multimodal decoders, which may generate captions, answers, or segmentation masks (Yu et al., 2022, Luo et al., 13 Jun 2025). Hybrid architectures (e.g., TAViS) use text as a cross-modal bridge, aligning audio, image, and segmentation spaces via pseudo-text embeddings.
Zero-shot segmentation and detection pipelines: In Text2Seg, text prompts drive instance- or class-specific mask generation in remote-sensing imagery by fusing output from visual FMs (Grounding DINO, CLIP Surgery) and mask generators (SAM), with post-hoc filtering or template selection by CLIP (Zhang et al., 2023). Prompt templates (single-word vs. tailored multi-term) are empirically shown to control class selectivity and performance in text-driven dataset mining workflows (El-Hajj et al., 2023).
Graph-prompted models: TAG FMs such as GraphCLIP are trained by constructing graph–summary pairs using LLMs and then aligning their representations with a contrastive loss. At inference, textual class prompts ("This paper belongs to {class}") guide zero-shot or few-shot classification of graph nodes (Zhu et al., 14 Oct 2024).
Diffusion models for generation: Text-to-image FMs such as CosmicMan use dense text prompts decomposed semantically by region or attribute, aligning detailed description groups with spatial regions through modified attention decomposition and refocusing objectives (Li et al., 1 Apr 2024).

3. Prompt Optimization Methods: Taxonomy and Practices

Prompt engineering in FMs splits into several orthogonal optimization regimes (Li et al., 17 Feb 2025):

Method Paradigm	Search Space	Core Algorithm
FM-based black-box	Discrete	Prompt-as-input, LLM-based generation, ranking
Evolutionary algorithms	Discrete	Population search, crossover, mutation
Gradient-based	Discrete/Continuous	Gradients for embedding or masked inference
Reinforcement learning	Discrete/Hybrid	Prompt edit actions as RL policy

Black-box search uses the FM itself as an optimizer: iterative prompt refinement, meta-prompting, and response-based selection.
Evolutionary schemes introduce population-based exploration of prompt spaces via genetic operations.
Gradient-based methods apply (where model access allows) direct tuning of continuous soft tokens or, in discrete spaces, surrogate gradient tricks.
RL-based prompt engineering models prompt sequence edits as decision processes, rewarding improved task performance.

Empirical benchmarks demonstrate that soft prompt tuning, in particular, can recover nearly full fine-tuning performance with orders of magnitude fewer updated parameters, and that optimized prompts can yield 2–10% improvements over manual prompt baselines across text, vision, and graph tasks (Li et al., 17 Feb 2025, Mittal et al., 2023, Zhu et al., 14 Oct 2024, Li et al., 1 Apr 2024).

4. Application Domains and Case Studies

Text-prompted foundation models are deployed across a range of domains:

Vision: CoCa enables zero-shot classification, retrieval, and captioning from prompt templates without further training (Yu et al., 2022). Text2Seg and Prompt me a Dataset illustrate segmentation and historical object extraction in remote sensing and document images, respectively, with prompt vocabulary and length acting as principal control levers (Zhang et al., 2023, El-Hajj et al., 2023). CosmicMan demonstrates dense, region-aligned captioning for high-precision generation of human images (Li et al., 1 Apr 2024).
Graphs: GraphCLIP pioneers zero/few-shot node classification by prompting with class-defining summaries, leveraging LLM-augmented summaries as training targets for a contrastive dual encoder. No label data is needed for pretraining, and prompt tuning with crafted label templates delivers top cross-domain transferability (Zhu et al., 14 Oct 2024).
Audio-Visual Segmentation: TAViS bridges feature spaces between foundation models for audio-visual alignment, using text-based prototype representations as the connecting hub and leveraging cross-entropy-based alignment supervision (Luo et al., 13 Jun 2025).
Natural Language Processing: Policy violation detection with minimal supervision leverages a hard prompt specifying extractive and generative sub-tasks (keywords, citations, explanations) with soft prompt tuning for rapid adaptation, requiring only tens to thousands of labeled examples (Mittal et al., 2023). Optimization surveys catalog state-of-the-art soft, hard, and hybrid prompt engineering across standard NLP benchmarks (Li et al., 17 Feb 2025).

5. Stability, Robustness, and Prompt Sensitivity

Prompted FMs are known to exhibit substantial instability: small, semantically neutral shifts in prompt wording can substantially degrade accuracy and consistency (Stewart et al., 26 Aug 2024). Formalization quantifies instability by measuring expected loss or performance drop under prompt perturbations: $\mathcal{I}(\theta) = \mathbb{E}_{(m,t,y),\,\delta}\Bigl[L\bigl(f(m,t;\theta),y\bigr)\;-\;L\bigl(f(m,t+\delta;\theta),y\bigr)\Bigr]$ Experiments (e.g. on OFASys and Unified-IO) show up to 50% BLEU reduction on multimodal VQA upon paraphrasing the prompt. Augmenting training data with diverse, semantically filtered paraphrases, and selectivity via text or modality similarity, dramatically improves both mean performance and stability (e.g. +0.3 to +0.6 BLEU), with no catastrophic tradeoff on original prompts (Stewart et al., 26 Aug 2024).

In prompt-sensitive applications, even formatting choices (spacing, tag syntax) and rare extreme-exemplar effects can alter calibration, as observed in policy violation FMs (Mittal et al., 2023). For vision, prompt length and specificity are shown to disambiguate class mappings and reduce semantic drift (El-Hajj et al., 2023).

6. Limitations, Challenges, and Open Research Directions

Despite their versatility, text-prompted FMs face fundamental challenges:

Semantic granularity limit: Zero-shot text–image models often fail to distinguish fine-grained or rare classes, especially when prompts are too vague or out-of-distribution (El-Hajj et al., 2023).
Stability and robustness: Prompt fragility remains, especially under paraphrasing, rephrasing, or unseen prompt structures (Stewart et al., 26 Aug 2024). This suggests broadening prompt distributions via augmentation or training is necessary for stable deployment.
Interpretability of soft/hybrid prompts: Soft prompt vectors are effective but opaque, raising concerns for transparency and auditability in critical applications (Li et al., 17 Feb 2025).
Optimization constraints: Constrained prompt optimization, Pareto frontier objective balancing (accuracy/fairness/robustness), and online adaptation under distributional shift remain unsolved (Li et al., 17 Feb 2025).
Cross-modal generalization: Methods for hybrid text–image/audio/graph prompt strategies are less mature than for single-modality text FMs, with emerging evidence that text-bridged approaches (as in TAViS) can serve as a general solution (Luo et al., 13 Jun 2025).
Bias and safety: Textually steered models can inherit or amplify pretraining biases; specialized datasets and annotation loops (as in CosmicMan) can mitigate, but not fully solve, such issues (Li et al., 1 Apr 2024).
Scalability of pipeline inference: In end-to-end vision pipelines, use of large foundational backbones leads to steep computational cost per input, which may limit their practicality for production (El-Hajj et al., 2023).

Key open research areas include constrained prompt design, agent-oriented prompt planning in stateful environments, zero-shot compositionality, and robust online and continual prompting schemes (Li et al., 17 Feb 2025).

7. Quantitative Performance, Empirical Highlights, and Ablation Insights

Model performance gains and ablation highlights include:

CoCa: 86.3% ImageNet zero-shot, 90.6% frozen-encoder, 91.0% full finetuned; state-of-the-art or competitive on video, VQA, retrieval, and captioning (Yu et al., 2022).
GraphCLIP: 70.19% zero-shot node accuracy (WikiCS), +15.3 pp over ZeroG, with invariant contrastive objective and LLM-generated summaries central to transferability (Zhu et al., 14 Oct 2024).
TAViS: 𝓙=84.8, 𝓕=0.912 on AVSBench (object); ablations reveal text-bridged alignment and prompting each contribute 1-2% gains; text alignment is essential for cross-modal semantic transfer (Luo et al., 13 Jun 2025).
Text2Seg: Relative improvement in zero-shot segmentation compared to SAM between 31% and 225% across benchmarks, albeit with only qualitative evaluation reported; prompt orchestration and model fusion by logical union/thresholds (Zhang et al., 2023).
CosmicMan: FID reduces by 27% over prior baseline, semantic alignment improves by 10% absolute, with preference scores >80% over SDXL and DALL·E 3 in user studies; Daring refocusing loss and dense attribute-conditioned prompts are key (Li et al., 1 Apr 2024).
Prompt instability mitigation: BLEU improvements of up to ×8 on audio QA and ×2–3 on image/video tasks are realized by prompt perturbation retraining (Stewart et al., 26 Aug 2024).

Ablation studies confirm that prompt specificity, alignment losses, invariant objectives, and LLM-generated prompt/summary supervision underlie the performance and generalization properties of state-of-the-art text-prompted FMs (Zhu et al., 14 Oct 2024, Luo et al., 13 Jun 2025, Li et al., 1 Apr 2024, Li et al., 17 Feb 2025).

References:

(Zhang et al., 2023) Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models (El-Hajj et al., 2023) Prompt me a Dataset: An investigation of text-image prompting for historical image dataset creation using foundation models (Luo et al., 13 Jun 2025) TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models (Yu et al., 2022) CoCa: Contrastive Captioners are Image-Text Foundation Models (Stewart et al., 26 Aug 2024) Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models (Zhu et al., 14 Oct 2024) GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs (Mittal et al., 2023) Using Foundation Models to Detect Policy Violations with Minimal Supervision (Li et al., 1 Apr 2024) CosmicMan: A Text-to-Image Foundation Model for Humans (Li et al., 17 Feb 2025) A Survey of Automatic Prompt Engineering: An Optimization Perspective