Papers
Topics
Authors
Recent
2000 character limit reached

Modality-Common Prompting (MCP)

Updated 20 November 2025
  • MCP is an architectural framework that dynamically isolates modality-common features from modality-specific cues to improve cross-modal learning.
  • It employs techniques such as instance normalization, adaptive masking, and shared prompt matrices to enhance generalization and mitigate catastrophic forgetting.
  • MCP offers a parameter-efficient alternative for robust applications in multimodal tasks like visible-infrared person re-identification and scenarios with missing modalities.

Modality-Common Prompting (MCP) is an architectural and algorithmic framework designed to efficiently extract, inject, and preserve information shared across multiple input modalities in deep learning systems, particularly transformer-based networks. MCP aims to maximize cross-modal generalization and mitigate catastrophic forgetting by dynamically isolating and highlighting features that are genuinely invariant or common across data modalities, while suppressing or disentangling modality-specific cues. This approach underpins recent advances in fields such as lifelong multimodal learning, visible-infrared person re-identification, and multimodal robustness with missing data, offering a parameter-efficient alternative to traditional modality-specific or missing-pattern-specific prompting.

1. Conceptual Overview and Motivation

MCP explicitly targets the problem of inter-modal knowledge interference that occurs when modality-common (e.g., shape, pose) and modality-specific (e.g., color, texture) information is entangled within learned representations. In sequential or lifelong settings—such as visible-infrared person re-identification (VI-LReID)—existing prompt-based approaches often blend these sources of information into a monolithic parameter pool, exacerbating catastrophic forgetting and degrading performance when modalities shift or are missing. MCP corrects for this by learning a dedicated, dynamically purified set of prompt vectors (or a prompt matrix), designed to exclusively encode features that co-exist across all relevant modalities, ensuring that the backbone model is steered by stable, invariant factors regardless of modality drift (Cui et al., 19 Nov 2025, Chen et al., 23 Dec 2024, Liang et al., 2022).

In multimodal fusion and robustness to missing modalities, MCP’s unified treatment—using a single global prompt space, optionally modulated by small per-modality weights—reduces parameter redundancy and improves generalization, as the entire prompt subspace is exposed to the joint distribution of all available modalities during training (Chen et al., 23 Dec 2024, Liang et al., 2022).

2. MCP Architectures Across Modalities and Tasks

MCP has been instantiated in several influential paradigms:

  • In CKDA for VI-LReID, MCP sits before a ViT-B/16 backbone and processes each input image through tokenization, instance normalization (to erase style), adaptive masking via two-layer MLP gates, fusion of normalized and original features, and final writing to a set of learnable prompt tokens kcom\bm{k}_{com}. These are then injected into the transformer's attention stream as common-factor tokens (Cui et al., 19 Nov 2025).
  • In EPE-P for multimodal learning with missing modalities, MCP maintains a single global prompt matrix BB, partitioned into m×mm\times m blocks for mm modalities. When modalities are missing, small per-modality prompt-weight matrices AMiA_{M_i} select (via block-wise multiplication) the relevant sub-prompt for the current missing-pattern scenario, yielding a parameter-efficient construction that precludes the exponential blow-up of missing-aware prompts (Chen et al., 23 Dec 2024).
  • In PromptFuse/BlindPrompt, MCP is realized as a shared prompt embedding matrix PP concatenated with frozen per-modality encoder outputs before being processed by a downstream LLM. No per-modality parameters are introduced, and only PP is trained—ensuring extreme modularity and parameter economy (Liang et al., 2022).
MCP Variant Parameterization Core Injection Point
CKDA MCP Instance-norm + adaptive fusion + MLP Prepend to ViT backbone tokens
EPE-P MCP Global prompt + per-modality weights Prepend to ViLT/transformer layers
PromptFuse MCP Shared prompt matrix PP Prepend to PLM and modality tokens

3. Mathematical Formulation and Feature Isolation

The essence of MCP is the explicit mathematical disentanglement of modality-common information.

For CKDA:

  • After tokenization and embedding, a feature xori\bm{x}_{ori} is instance-normalized to yield xin\bm{x}_{in}. MLP-based masks eoe^o and eie^i are computed:

eo=σ(W2oδ(W1oxori)),ei=σ(W2iδ(W1ixin))e^o = \sigma(W_2^o \delta(W_1^o \bm{x}_{ori})),\quad e^i = \sigma(W_2^i \delta(W_1^i \bm{x}_{in}))

  • Fused output:

xcom=eoxori+(1eo)(eixin)\bm{x}_{com} = e^o \odot \bm{x}_{ori} + (1-e^o)\odot(e^i\odot\bm{x}_{in})

  • Decoded to final prompt tokens:

kcom=Epc(δ(patch(xcom)))\bm{k}_{com} = \mathcal{E}_{pc}(\delta(\text{patch}(\bm{x}_{com})))

For EPE-P:

  • For missing modality set Mmiss\mathcal M_{miss}, sum their prompt weights:

Asum=i=1m1Mmiss(Mi)AMiA_{\text{sum}} = \sum_{i=1}^m \mathbf{1}_{\mathcal M_{\text{miss}}}(M_i) A_{M_i}

  • Use block-wise Kronecker-like multiplication:

P=AsumBP = A_{\text{sum}} \divideontimes B

where BB is block-decomposed and each block factorized as Bij=uijvijB_{ij} = u_{ij} v_{ij}^\top.

  • Resulting prompt PP is prepended in the transformer encoder.

For PromptFuse:

  • A shared prompt matrix PRN×dP \in \mathbb{R}^{N\times d} is concatenated as

H0=concat(P,Em1,,EmK)H^0 = \text{concat}(P, E_{m_1}, \dots, E_{m_K})

where each EmiE_{m_i} is an embedding sequence from frozen encoder fmif_{m_i}.

This formalism ensures that only modality-invariant factors are represented in kcom\bm{k}_{com} or PP, optimally leveraging multi-source data without cross-modality interference.

4. Optimization Strategies and Loss Formulations

MCP parameters are integrated into global training objectives alongside other module parameters:

  • CKDA Optimization: The overall loss comprises standard classification and triplet components (Lce\mathcal L_{ce}, Ltrip\mathcal L_{trip}), a prompting loss Lp\mathcal L_p that penalizes deviation of the current prompt from its earlier stage (to reduce forgetting),

Lp=kpm,(s)kpm,(s1)1\mathcal L_p = \|\bm{k}_p^{m,(s)}-\bm{k}_p^{m,(s-1)}\|_1

and alignment losses from the CKA module that propagate gradients through the backbone into MCP (Cui et al., 19 Nov 2025).

  • EPE-P Optimization: The total loss is an Evidence-based Loss

L=(1λ)Leb+λLKLL = (1-\lambda) L_{eb} + \lambda L_{KL}

combining Dirichlet-based evidence regularization and KL-divergence to mitigate overconfident predictions under uncertainty induced by missing modalities (Chen et al., 23 Dec 2024).

  • PromptFuse Optimization: Standard cross-entropy loss is used; only the common prompt is trainable, with frozen backbone and per-modality encoders (Liang et al., 2022).

5. Empirical Results and Parameter Efficiency

Across benchmarks and scenarios, MCP exhibits robust performance improvement and model efficiency.

  • CKDA Ablation (VI-LReID, Table 3):

| Configuration | mAP | R-1 | |----------------------|------|------| | Base (no prompts) | 31.8 | 33.9 | | + MCP only | 33.4 | 35.2 | | + MCP + MSP | 34.6 | 37.4 | | + CKA only | 34.9 | 37.9 | | Full CKDA | 36.3 | 39.4 | A gain of +1.6 mAP and +1.3 R-1 is attributable to MCP’s explicit separation of common features (Cui et al., 19 Nov 2025).

  • EPE-P (50–60% random missing modalities):

| Method | MM-IMDb F1-Macro | Hateful Memes AUROC | |--------------|------------------|---------------------| | ViLT | ∼39–42 | ∼60–63 | | MAP | ∼43–44 | ∼62–65 | | EPE-P (full) | ∼46–48 | ∼64–67 | EPE-P attains +2.60 and +3.23 improvements at increasing missing rates, with additional gains from its evidence-based loss (Chen et al., 23 Dec 2024).

  • PromptFuse Parameter Comparison:

| Fusion Method | # Trainable Parameters | |-----------------|-----------------------| | Finetune VE | 86M | | JointProj | 1M | | PromptFuse | 15K | | BlindPrompt | 15K |

MCP’s parameter overhead is minimal, often less than 0.1% of baseline finetuning or dense adapters, while supporting modular expansion (new modalities added with no retraining of previous prompts).

6. Interactions with Complementary Modules and Practical Variants

In composite architectures, MCP typically interfaces with modality-specific prompting (MSP) and cross-modal alignment (CKA) modules:

  • CKDA: MCP yields kcom\bm{k}_{com}, MSP yields per-modality prompts kspem\bm{k}_{spe}^m; the sum is injected as the full prompt, enabling both common feature learning and modality adaptation. CKA separately aligns these representations in prototype space to maintain independence and consistency (Cui et al., 19 Nov 2025).
  • Scalability and Robustness: MCP has been demonstrated across 2- and 3-modality problems (vision-language, vision-language-audio) and is robust to varied prompt lengths, injection positions, and PLM backbones. Qualitative analysis demonstrates that MCP tokens strongly attend to modality-invariant attributes (contours, body shape), whereas MSP or per-modality parameters capture style, color, or sensor artifacts (Cui et al., 19 Nov 2025, Liang et al., 2022).

7. Limitations, Open Challenges, and Future Directions

Despite its efficiency and generality, MCP has recognized limitations:

  • Scalability to many modalities: EPE-P’s unified prompt extraction and PromptFuse’s single-prompt instantiation have been empirically validated primarily for bi-modal and tri-modal tasks. Extension to a larger number of modalities (m>2m>2) is a prominent direction (Chen et al., 23 Dec 2024).
  • Prompt composition: Current block-wise multiplication and weight summation schemes may be suboptimal for complex modality-absent patterns. Future work suggests learned attention or gating could generalize prompt selection beyond linear or additive mechanisms (Chen et al., 23 Dec 2024).
  • Uncertainty modeling: The adoption of evidence-based losses formalizes uncertainty calibration under missing information but remains limited. Richer uncertainty quantification (e.g., hierarchical Dirichlet, meta-learning to adapt prompt usage rates) offers further promise (Chen et al., 23 Dec 2024).

A plausible implication is that, as prompt-based fusion and disentanglement spread to higher-order multimodal settings and continual learning, MCP will require integration with more dynamic prompt selection, alignment strategies, and advanced regularization to fully realize its parameter and generalization advantages across broader machine perception tasks.


References:

CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification (Cui et al., 19 Nov 2025) EPE-P: Evidence-based Parameter-efficient Prompting for Multimodal Learning with Missing Modalities (Chen et al., 23 Dec 2024) Modular and Parameter-Efficient Multimodal Fusion with Prompting (Liang et al., 2022)

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Modality-Common Prompting (MCP).