CoPS: Conditional Prompt Synthesis
- Conditional Prompt Synthesis (CoPS) is a family of methods that dynamically generates instance-specific prompts to guide frozen deep models using minimal trainable parameters.
- It utilizes techniques such as expert pooling, soft routing, and attention mechanisms to adaptively merge input features and auxiliary modalities for improved task performance.
- Empirical studies show that CoPS achieves robust transfer learning results, balancing parameter efficiency with significant accuracy gains in multi-modal and zero-/few-shot scenarios.
Conditional Prompt Synthesis (CoPS) refers to a family of parameter-efficient approaches for synthesizing adaptive, input- or context-conditioned prompts that steer large, typically frozen, deep models (such as LLMs, vision models, or multimodal transformers) toward instance- or task-specific behavior. CoPS exploits conditionality—via learnable functions, routing mechanisms, or attention over expert pools—to move beyond static prompt templates and achieve strong performance and generalization with minimal trainable parameters, particularly in transfer (zero- and few-shot), multi-modal, and compositional settings.
1. Definitions and Conceptual Foundations
Conditional Prompt Synthesis is formally characterized by the goal of learning a prompt-synthesizing function (with trainable parameters) that maps instance-specific information or task metadata into a continuous prompt, which is then injected into a frozen, pre-trained backbone model to adapt its predictions. The core feature distinguishing CoPS from vanilla prompting is that the prompts are not fixed or purely learned per class/task, but instead dynamically derived from input or side information—potentially including auxiliary modalities, semantic priors, learned expert pools, or structured rule systems.
Variants of CoPS exist in multiple domains:
- In language modeling, CoPS modules can transform task instructions or input tags into differentiable prompt vectors that enable task-specific outputs from a fixed LLM (Pilault et al., 2023).
- In vision and multimodal problems, CoPS mechanisms synthesize text and/or visual prompts by conditioning on visual features, semantic prototypes, or even other modalities, as in prompt fusion or anomaly detection (Jiang et al., 2023, Yang et al., 11 Jul 2025, Chen et al., 5 Aug 2025).
Common architectural design principles of CoPS include:
- Pooling a set of learnable "prompt experts," which are compositely mixed via softgating or routing;
- Using lightweight selector/routing networks (MLP, attention, production system modules) that output instance-dependent mixture weights;
- Integration of the conditional prompt into the model input, attention layers, or feature stages to bias computation toward relevant semantics.
2. Architectural Methodologies
2.1 Expert Pools and Soft Routing
A prevalent CoPS formulation employs a pool of learnable prompt experts , each , where is the prompt token count and the backbone hidden dimension. For a given input , a selector network synthesizes a weighted mixture:
Here, is an input embedding extractor (e.g., pooled features), and are learned (Wang et al., 2023). This prompt is then concatenated or injected into the main model at the input or early transformer/CNN stages.
2.2 Mixture of Prompt Experts (MoPE)
Conditional Prompt Tuning for multimodal fusion (Jiang et al., 2023) proposes an extension where, for each instance and each transformer layer , a dynamic prompt is constructed as a soft mixture of prompt experts, with soft routing scores (dependent on a prior modality):
To balance expert utilization, an "importance loss" regularizes the distribution of routing weights across a mini-batch, incentivizing balanced expert usage and preventing degeneracy.
2.3 Production-System Modules
In language modeling, CoPS can be realized via differentiable production systems, as in PRopS (Pilault et al., 2023), where a set of rule modules (attention or MLP blocks) are selected and composed according to the input/task condition. A Gumbel-softmax-based selector induces sparsity in module choice:
where embeds the input/task, are the rule modules, and is the selected set of k modules.
2.4 Cross-Modal Conditional Prompting
Recent VLM research (Yang et al., 11 Jul 2025) synthesizes both text and visual conditional prompts via mutual guidance. Semantic prompts are extracted with a multi-modal LLM (MLLM) using attention over the MLLM’s decoder cache, followed by adaptation into VLM space. Visual prompts are then constructed by mutually guiding visual and semantic features through self- and cross-attention (AMG module).
2.5 Prototype and Semantic Token Enhancement
For zero-shot anomaly detection, CoPS (Chen et al., 5 Aug 2025) leverages explicit state prototypes (extracted via cross-attention over patch features) and implicit semantic class tokens (sampled via VAE from global image features) to assemble context-rich, state-aware prompts. A spatially-aware alignment module further refines prompt effectiveness for both image-level and pixel-level detection.
3. Training Objectives and Regularization
These methods generally follow “frozen backbone, trainable prompt” paradigms, optimizing the conditional prompt parameters along with the selector/routing modules via a standard downstream task loss, often cross-entropy or binary cross-entropy, with added regularization for balanced expert usage, prompt diversity, or compositional sparsity.
For instance, (Jiang et al., 2023) minimizes:
while (Chen et al., 5 Aug 2025) optimizes joint modular losses for state prototype alignment, variational class sampling, and spatial text-image alignment:
Contrastive learning objectives are common for multimodal settings, with additional regularizers enforcing consistent prompt usage or feature alignment with augmentations.
4. Applications and Empirical Impact
Conditional Prompt Synthesis methods have been validated across a diverse range of applications:
| Domain | CoPS Implementation (Source) | Representative Gains |
|---|---|---|
| Video Action Recognition | Soft Conditional Prompt Learning (SCP) (Wang et al., 2023) | +3.17–10.2% accuracy on Okutama, NECDrone, SSV2 |
| Multimodal Fusion | MoPE-based conditional tuning (Jiang et al., 2023) | SOTA with 0.7% params, matches or exceeds fine-tuning |
| Zero-shot Anomaly Detection | CoPS (Chen et al., 5 Aug 2025) | +2.5 pp AUROC vs. prior SOTA (92.5% vs. 90.0%) |
| Vision-LLMs | MuGCP (Yang et al., 11 Jul 2025) | +2.01% (few-shot HM metric) vs. previous best |
| LLM Adaptation | PRopS (Pilault et al., 2023) | +15.5% compositional EM accuracy over baseline |
In all settings, CoPS techniques consistently outperform non-conditional prompt methods or static prompt baselines—particularly in settings with limited data, task composition, or where parameter efficiency is critical.
5. Comparative Analysis and Ablations
Empirical analyses reveal several robust properties of CoPS designs:
- Expressivity: Instance-conditional prompt synthesis, especially via expert pooling or mixture modules, scales more effectively than simply enlarging prompt length (Jiang et al., 2023).
- Generalization: Compositional/gated prompt systems (e.g., PRopS) enable zero- and few-shot transfer by reusing learned “subprompts” for novel input combinations, yielding sample-efficient generalization (Pilault et al., 2023).
- Parameter Efficiency: Across benchmarks, CoPS approaches achieve high accuracy with 1–10% (often <1%) of the trainable parameters required for full fine-tuning or adapter-based transfer.
- Ablations: Removing dynamic routing, regularization, or mutual-attention modules causes significant drops in performance and generalization (e.g., importance loss prevents expert collapse, full AMG and multi-prompt fusion boost generalization in MuGCP (Yang et al., 11 Jul 2025)).
- Prompt Diversity: Balanced utilization of prompt experts (encouraged via importance loss or similar terms) prevents routings from collapsing onto a few experts and improves robustness to data scaling (Jiang et al., 2023).
6. Limitations and Open Directions
Known constraints of CoPS methodologies include:
- Selector Design Sensitivity: Effectiveness depends on the capacity, architecture, and regularization of the selector/router. Overly simplistic routings cannot capture complex input variability; poorly regularized selectors collapse onto a small subset of experts.
- Prompt Interpretation: Learned prompt experts or composed modules do not always correspond to semantically interpretable factors or tasks; understanding prompt semantics remains an open problem (Pilault et al., 2023).
- Resource Overhead: Some advanced schemes (e.g., MuGCP (Yang et al., 11 Jul 2025)) require substantial compute and memory due to reliance on MLLM decoders, offline caching, or multiple attention modules.
- Domain/Task Portability: Optimal pool size, token counts, and fusion topology are task-dependent and may require extensive ablation.
- Noise and Overfitting: External priors (such as MLLM-generated semantic embeddings) can encode irrelevant or spurious context, necessitating future research into content filtering and adaptive knowledge distillation (Yang et al., 11 Jul 2025).
Directions for future exploration include: lightweight distillation of prompt knowledge, dynamic gating, efficient memory management for prompt caches, and extension of CoPS principles to detection, segmentation, or video-language alignment (Yang et al., 11 Jul 2025).
7. Theoretical Properties and Interpretability
Initial theoretical evidence (e.g., (Pilault et al., 2023)) suggests that CoPS-style modular prompt systems retain favorable sample complexity compared to monolithic prompt learning, provided module selection is sparse and compositionally structured. Proposition 1 in (Pilault et al., 2023) formalizes that, under compositional reuse and sufficient expressivity, risk can be made arbitrarily close to the Bayes risk with polynomially fewer samples relative to the number of modules and composed subtasks. This suggests that prompt libraries can span large task spaces while preserving compactness and adaptation speed. Moreover, gating scores or module activations in these systems provide an interpretable basis for analyzing input-to-prompt mappings, though the alignment to human-interpretable subtasks is not always guaranteed.
References
- "SCP: Soft Conditional Prompt Learning for Aerial Video Action Recognition" (Wang et al., 2023)
- "Conditional Prompt Tuning for Multimodal Fusion" (Jiang et al., 2023)
- "CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection" (Chen et al., 5 Aug 2025)
- "On Conditional and Compositional LLM Differentiable Prompting" (Pilault et al., 2023)
- "Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-LLMs" (Yang et al., 11 Jul 2025)