Modality-Specific Prompting (MSP)

Updated 20 November 2025

Modality-Specific Prompting (MSP) is a training paradigm that assigns dedicated, learnable prompts to each modality, reducing exponential prompt growth to linear scalability.
It integrates the modality-specific prompts with frozen pretrained architectures, enabling robust handling of missing modalities and promoting effective cross-modal fusion.
The approach leverages an orthogonality loss and selective training protocols to ensure parameter efficiency, enhanced robustness, and scalable continual learning.

Modality-Specific Prompting (MSP) is an architectural and training paradigm for parameter-efficient and robust adaptation of large multimodal models to variable or missing modality scenarios. MSP assigns a dedicated, learnable prompt to each modality, integrating these prompts into frozen pretrained architectures to encode modality-unique information, address missing-modality cases without exponential prompt explosion, and enhance robustness and modularity in fusion and continual learning settings. The approach combines parameter efficiency with mechanisms to promote cross-modal diversity and task stability, enabling large-scale transformers to generalize in complex real-world multimodal environments.

1. Foundational Principles of Modality-Specific Prompting

MSP replaces the classic “missing-aware prompt” (MAP) paradigm, in which one learns a distinct prompt for every possible missing-modality pattern—a count of $2^M-1$ for $M$ modalities. In contrast, MSP allocates a single prompt per modality, reducing the count to $M$ and thus converting exponential growth to linear, critically improving scalability. Each modality-specific prompt ( $P_m \in \mathbb{R}^{L\times d}$ for modality $m$ ) is trained to capture the unique cues characteristic of its input domain, for example, image texture or linguistic tokens.

Prompt integration follows an aggregation scheme: for a sample with present modalities $S \subseteq M$ , the composite prompt is $P_S := \sum_{m\in S} P_m$ , which is then concatenated with the task-specific data and fed into the frozen backbone transformer. Absent modalities use placeholder (dummy) tokens to preserve sequence shape, ensuring architectural invariance regardless of input sparsity (Jang et al., 2023).

The core motivations are:

Exponential prompt reduction: From $O(2^M)$ to $O(M)$ prompts.
Robustness to arbitrary missing patterns: Each modality’s prompt is always trained when its modality is present, so new test-time missingness patterns are guaranteed to be covered.
Parameter and compute efficiency: Only a small set of trainable vectors for prompts and minimal task heads are required; encoder weights are frozen.

2. Prompt Architecture, Orthogonality, and Loss Formulation

Prompts are high-dimensional trainable tokens. For each modality, define $P_m \in \mathbb{R}^{L\times d}$ , where $L$ is the prompt length and $d$ the model’s hidden state dimension. Input token embeddings $x_m \in \mathbb{R}^{N\times d}$ for each modality $m$ are concatenated after the composite prompt, producing an input sequence: $\text{InputTokens} = [\mathrm{CLS}] \; \big\| \; \sum_{m \in S} P_{m} \; \big\| \; [x_{m_1},\,x_{m_2},\dots,x_{m_M}]$

To ensure modality-specificity (i.e., minimal redundancy between prompts), an orthogonality constraint is imposed. For two modalities (e.g., image and text) with prompts $P_{is}$ and $P_{ts}$ , flatten ( $f(\cdot)$ ) each and calculate: $L_\mathrm{ortho} = \frac{\left|f(P_{is})\cdot f(P_{ts})\right|}{\max\big(\|f(P_{is})\|_2\|f(P_{ts})\|_2,\epsilon\big)}$ with $\epsilon \ll 1$ for numerical stability. This loss is backpropagated jointly with the main task loss $L_\mathrm{cls}$ (cross-entropy or binary cross-entropy), producing the total training objective: $L_\mathrm{total} = L_\mathrm{cls} + \lambda L_\mathrm{ortho}$ where $\lambda$ determines the regularization strength (Jang et al., 2023). This constraint ensures diversity in learned feature representations across modalities.

3. Training Protocols and Optimization

Backbones (image/text encoders, transformer/fusion modules) remain frozen during prompt training. Only the prompts, sequence pooler, and the classifier head are learnable, substantially reducing parameter count. AdamW is typically used (learning rate $1\mathrm{e}{-2}$ , weight decay $2\mathrm{e}{-2}$ , batch size 6 in (Jang et al., 2023)) and curriculum/augmentation includes randomly “dropping” modalities per mini-batch, ensuring uniform training coverage over all missingness patterns. Batches are constructed to contain fully observed samples as well as all single-modality and mixture cases.

Prompt ablations confirm robustness to hyperparameters: prompt length between 8–36 suffices, multi-layer injection (early in the transformer) outperforms single-layer, and even short prompts deliver significant robustness gains (Lee et al., 2023, Liang et al., 2022).

4. Comparative Analysis: MSP vs. MAP and Parameter-Efficient Alternatives

A crucial distinction between MAPs and MSPs lies in scaling and generalization:

Prompt count: MAPs introduce $2^M-1$ prompts, becoming prohibitive with increased modalities; MSPs grow only linearly.
Robustness: MAPs fail on unseen missingness patterns, as only those encountered during training correspond to trained prompts. MSPs inherently cover the space of any observable modalities.
Parameter efficiency: Directly, MSPs’ parameter cost is $M\cdot d\cdot l$ ; MAPs is $(2^M-1)\cdot d\cdot l$ . Evidence-based parameter-efficient alternatives like EPE-P further factorize or share prompts, achieving $(d+l)\,r+m^3$ , where $r \ll d,l$ (Chen et al., 23 Dec 2024).

Empirically, MSPs show consistent improvements (1–6 points absolute over MAPs, 4–11 over frozen baselines), and reduced performance variance (increased robustness) across a range of missing-modality scenarios (Jang et al., 2023, Chen et al., 23 Dec 2024).

Approach	Prompt Count	Robustness to Unseen Missing	Example (M=3)
MAP	$2^M-1$	No	7
MSP	$M$	Yes	3
EPE-P	$\sim r(d+l)+m^3$	Yes	$<3m^2$ params

5. Advanced MSP Mechanisms and Applications

The MSP framework is versatile across domains and tasks:

Continual Learning with Catastrophic Forgetting Mitigation: Prompt partitioning into modality-specific, task-aware, and task-specific prompts, with contrastive losses for inter-task interaction, supports robust continual adaptation without replay or catastrophic forgetting. Prompt freezing after each task isolates adaptation for new streams (Guo et al., 1 Mar 2025).
Cross-Modal Generalization and Fusion: Per-modality prompts can be integrated with modality-common prompts and dynamic inter-layer (correlated) prompts to allow both sample-specific and global fusion, utilizing regularized architectures that maintain parameter efficiency while capturing complementary cross-modal semantics (Hu et al., 9 Oct 2024).
Task- and Instance-Conditional Adaptation: By leveraging dynamic routing, mapped prompts, and mixtures of prompt experts, the architecture can produce highly expressive, instance- and context-adaptive fusion (as in Conditional Prompt Tuning; (Jiang et al., 2023)).
Medical Image Translation and Domain Generalization: Content-conditioned prompt extraction and fusion blocks (e.g., in MedPrompt) dynamically assemble per-input prompt tensors, yielding state-of-the-art image synthesis across MRI/CT/PET modalities (Chen et al., 2023).
Class-Incremental Recognition with Analytic Models: Prompt pools per modality, with analytic (recursive least-squares) solution for the linear classifier on top of the frozen backbone, enable robust multi-modal CIL, even with missing modalities at test time (Yue et al., 16 Jan 2025).

6. Representative Results, Empirical Observations, and Limitations

MSP demonstrates clear empirical advantages. On MM-IMDb (70% missing), MSP yields F1-macro of 38.34 vs. MAPs at 36.89 and ViLT at 34.26. MSP’s robustness is evidenced by smaller standard deviations across all missing-pattern evaluations. Orthogonality regularization further improves results—e.g., F1 rises from 34.90 to 36.97 with regularization (Jang et al., 2023).

Ablations and additional studies indicate:

Prompt diversity via orthogonality is critical.
Even without large-scale finetuning, MSPs can match or exceed fine-tuned performance in low-resource regimes (Liang et al., 2022).
Prompt length and layer placement are robust hyperparameters.
Limitations include extension to more than two modalities (where inter-prompt orthogonality must generalize), tuning of prompt length and regularization weights, and occasional underperformance in large-data regimes when compared to full-parameter finetuning.

Open questions involve scaling MSPs to $M>2$ , optimal curriculum strategies for missingness, and multi-modal dynamic prompt selection (Jang et al., 2023).

7. Implementation Guidance and Practical Impact

Implementing MSP in a modern transformer-based multimodal system involves:

Declaring trainable prompt tensors $P_m$ for each modality.
Constructing the input sequence as $[\mathrm{CLS}] \| P_S \| x_{m_1} \| ... \| x_{m_M}$ .
Handling missing modalities via dummy/placeholder tokens.
Adding and optimizing an orthogonality loss:

$L_\mathrm{ortho} = \frac{|f(P_{is})\cdot f(P_{ts})|}{\max(\|f(P_{is})\|_2\|f(P_{ts})\|_2,\epsilon)}$

Training with only the prompts and the top-level classifier/fusion head updatable.

MSP is a highly parameter- and compute-efficient solution for real-world, unreliable, and missing-modality settings. Its principles underlie state-of-the-art robust adaptation and continual learning methods in multimodal classification, image translation, video tracking, and more, with strong empirical support across diverse benchmarks (Jang et al., 2023, Chen et al., 23 Dec 2024, Hu et al., 9 Oct 2024, Chen et al., 2023, Guo et al., 1 Mar 2025, Liang et al., 2022, Lee et al., 2023).

References

Towards Robust Multimodal Prompting With Missing Modalities (Jang et al., 2023)
EPE-P: Evidence-based Parameter-efficient Prompting for Multimodal Learning with Missing Modalities (Chen et al., 23 Dec 2024)
Modular and Parameter-Efficient Multimodal Fusion with Prompting (Liang et al., 2022)
MedPrompt: Cross-Modal Prompting for Multi-Task Medical Image Translation (Chen et al., 2023)
Efficient Prompting for Continual Adaptation to Missing Modalities (Guo et al., 1 Mar 2025)
Deep Correlated Prompting for Visual Recognition with Missing Modalities (Hu et al., 9 Oct 2024)
Multimodal Prompting with Missing Modalities for Visual Recognition (Lee et al., 2023)
Conditional Prompt Tuning for Multimodal Fusion (Jiang et al., 2023)