Cross-Model Prompting

Updated 10 September 2025

Cross-model prompting is an approach that employs structured, hierarchical prompts to steer diverse models across modalities and tasks.
It integrates learned, hand-crafted, and dynamically generated prompts to inject rich contextual cues for improved domain transfer and few-shot performance.
Architectural innovations like depth-partitioned prompting and multi-prompt ensembles drive enhanced generalization and robustness across multi-modal benchmarks.

Cross-model prompting is a family of methods designed to steer, adapt, or coordinate the behavior of one or more models—potentially spanning different modalities, domains, or tasks—by means of carefully constructed prompts. Leveraging recent advances in vision-language foundation models, LLMs, and multi-modal pretraining, cross-model prompting exploits learnable, structured, or dynamically generated instructions to activate model capabilities beyond traditional fine-tuning paradigms. It encompasses methods such as depth-partitioned prompting, hierarchical prompt ensembles, cross-lingual/contextual prompting, dynamic cross-modal interaction, and semantic task-guided prompting.

1. Architectural Foundations and Multi-Prompt Designs

Conventional soft prompt learning typically introduces a single sequence of learnable tokens as model-guiding context, either concatenated with class tokens (in vision-LLMs) or prepended to the input in transformer-based architectures. Recent advances highlight limits of this approach, particularly in capturing diverse or hierarchical features in multi-modal tasks. The Partitioned Multi-modal Prompt (PMPO) framework exemplifies a transition to multi-prompt and depth-partitioned strategies: transformer layers in the visual encoder are partitioned such that each learnable prompt is mapped—via linear projections—into distinct layer depths. Formally, a set of $N$ prompts $\{p^n\}_{n=1}^N$ is projected to transformer block depths $D$ as:

$V_1, V_2, \ldots, V_D = g(p^1, \ldots, p^N)$

and, at each block $L_i$ ,

$[x_i, E_i] = L_i([x_{i-1}, V_{i-1}, E_{i-1}])$

where $x_i$ is the [CLS] token embedding and $E_i$ are remaining tokens. Multi-prompt ensembles are then used to compute class-wise text features and predictions via e.g., cosine similarity.

This architecture ensures prompts capture hierarchical and complementary information not accessible to “uni-prompt” designs. Multi-level, learnable prompts can be aligned with the hierarchical nature of representation learning in transformers, thereby activating features distributed across network depths (Tian et al., 2023).

Integration of prior, task-specific, or cross-modal information is central to robust cross-model prompting. PMPO demonstrates the use of blended ensembles: individually trained multi-prompts are combined with hand-crafted templates (e.g., “a photo of [class]”). The encoded text embeddings from both are ensembled by mean aggregation:

$T^*_i = \text{mean}( T(t^i_1), \ldots, T(t^i_N), T(t^i_{\text{prior}}) )$

where $T(\cdot)$ is the text encoder output for the $i$ -th class. This architectural motif generalizes well: semantic prompting (as in CRISP-SAM2) replaces geometric prompts with rich natural language descriptions, infusing neural segmentation models with domain-specific contextual semantics via progressive cross-modal interaction (Yu et al., 29 Jun 2025).

In multi-organ medical segmentation, this approach allows hierarchical transformers to be driven by descriptive cues, with cascaded cross-attention between visual features ( $F_V$ ) and textual features ( $F_T$ ), summarized by interaction modules: $F_{VT} = \text{CrossAttn}(F_T, F_V), \; F_{TV} = \text{CrossAttn}(F_V, F_T)$ followed by concatenation and second-stage cross-attention, yielding contextually enriched features ultimately controlling prompt token computation and mask generation.

3. Robustness, Generalization, and Domain Transfer

The explicit design of cross-model prompting addresses limitations in generalization and robustness—crucial in transfer, zero-shot, and few-shot scenarios. Partitioned and multi-prompt frameworks, such as PMPO, show improved harmonic mean accuracy on both seen and unseen classes (e.g., $79.28\%$ harmonic mean across 11 datasets with a +7.62 improvement over CoOp). Evaluation paradigms span:

New class generalization (training on base classes, testing on unseen classes)
Cross-dataset evaluation (training on a canonical dataset like ImageNet, evaluating elsewhere)
Domain generalization (robustness to shifts in data properties, e.g., ImageNet variations)

Hierarchical prompting, prompt-depth partitioning, and ensembling prevent overfitting and leverage both learned and human-curated contexts, resulting in more balanced adaptation performance and increased resilience to domain shifts (Tian et al., 2023, Yu et al., 29 Jun 2025).

Cross-model prompting provides a general methodology for aligning and combining information across models, tasks, or modalities. Inspired by approaches such as cross-task prompting (CroPrompt), outputs from one task are injected into the prompt context of a follow-on task, with explicit cross-task dependencies:

Initial prompt elicits intent detection.
The predicted intent is transferred via injection to a slot filling prompt, narrowing the candidate label set and boosting accuracy.

Multi-task self-consistency mechanisms—where multiple reasoning routes and output voting are performed at sentence- and token-level—can further suppress error propagation. Although primarily discussed in the context of SLU (Qin et al., 15 Jun 2024), these approaches generalize to broader cross-model settings, offering strategies to connect heterogeneous models or multi-modal branches (e.g., vision, language, audio).

Cross-model prompting shares surface similarities with other paradigms, but exhibits distinct structural innovations:

In contrast to static soft prompt learning, depth-partitioned or multi-prompt ensembles capitalize on the hierarchical and distributed nature of representation learning.
Rather than treating all prompts as equivalent, these methods allocate prompts to distinct network depths or tasks, enabling role- or context-specific control.
Cross-modal prompting strategies (e.g., semantic text prompts for vision models) move beyond geometric or fixed prompt schemes and condition representations on high-level descriptors rather than direct sensor-level observations.
Integration of prior, manually crafted templates with learnable prompts enables a balance between model flexibility and robust out-of-domain generalization.

6. Mathematical and Practical Implementation Considerations

Implementing cross-model prompting frameworks involves:

Allocating transformer depth partitions and assigning prompts to specific subsets of layers.
Designing multi-level prompt linear projections (bridges) to adapt prompt representations for integration at designated depths.
Combining learned prompts with fixed templates via mean or attention-based strategies.
Managing parameter and computational efficiency: multi-prompt architectures can be tuned with minimal additional parameters, facilitating deployment in scenarios requiring fast adaptation or memory efficiency.
Ensuring consistent inference time by leveraging prompt ensembling and minimizing dynamic reconfiguration.

Performance metrics in evaluated frameworks typically include task/class accuracy, harmonic mean accuracy, cross-domain transferability, as well as ablation studies highlighting the additive benefit of manual template integration and learned prompt diversity.

7. Prospective Extensions and Broader Implications

The fundamental architectural motif—specialized, hierarchically structured, or cross-modal prompts controlling pre-trained models—has implications for a broad array of tasks:

Vision-language reasoning, video analysis, and multi-modal fusion can benefit from hierarchical, adaptive prompt injection aligned with transformer depth or modality-specific encoders.
Future research may generalize partitioned prompting to additional modalities (e.g., point clouds, audio) or to systems with multiple encoders/decoders across tasks.
Semantic prompting, via detailed human-like descriptions (rather than geometric constraints), offers a path to reduced annotation cost, improved interpretability, and greater alignment with natural downstream applications.
Incorporation of domain knowledge and hierarchical contextual splitting can yield robust transferability even in the presence of distributional shifts and novel categories.

In summary, cross-model prompting constitutes an important technical evolution—extending prompting beyond the surface context into deeply integrated, multi-level, and cross-modal control. Such methods offer expanded generalization, robust adaptation, and greater compositionality across the growing universe of large-scale, modality-rich foundation models (Tian et al., 2023, Yu et al., 29 Jun 2025).