Modality-Common Prompting (MCP)

Updated 20 November 2025

MCP is an architectural framework that dynamically isolates modality-common features from modality-specific cues to improve cross-modal learning.
It employs techniques such as instance normalization, adaptive masking, and shared prompt matrices to enhance generalization and mitigate catastrophic forgetting.
MCP offers a parameter-efficient alternative for robust applications in multimodal tasks like visible-infrared person re-identification and scenarios with missing modalities.

Modality-Common Prompting (MCP) is an architectural and algorithmic framework designed to efficiently extract, inject, and preserve information shared across multiple input modalities in deep learning systems, particularly transformer-based networks. MCP aims to maximize cross-modal generalization and mitigate catastrophic forgetting by dynamically isolating and highlighting features that are genuinely invariant or common across data modalities, while suppressing or disentangling modality-specific cues. This approach underpins recent advances in fields such as lifelong multimodal learning, visible-infrared person re-identification, and multimodal robustness with missing data, offering a parameter-efficient alternative to traditional modality-specific or missing-pattern-specific prompting.

1. Conceptual Overview and Motivation

MCP explicitly targets the problem of inter-modal knowledge interference that occurs when modality-common (e.g., shape, pose) and modality-specific (e.g., color, texture) information is entangled within learned representations. In sequential or lifelong settings—such as visible-infrared person re-identification (VI-LReID)—existing prompt-based approaches often blend these sources of information into a monolithic parameter pool, exacerbating catastrophic forgetting and degrading performance when modalities shift or are missing. MCP corrects for this by learning a dedicated, dynamically purified set of prompt vectors (or a prompt matrix), designed to exclusively encode features that co-exist across all relevant modalities, ensuring that the backbone model is steered by stable, invariant factors regardless of modality drift (Cui et al., 19 Nov 2025, Chen et al., 2024, Liang et al., 2022).

In multimodal fusion and robustness to missing modalities, MCP’s unified treatment—using a single global prompt space, optionally modulated by small per-modality weights—reduces parameter redundancy and improves generalization, as the entire prompt subspace is exposed to the joint distribution of all available modalities during training (Chen et al., 2024, Liang et al., 2022).

2. MCP Architectures Across Modalities and Tasks

MCP has been instantiated in several influential paradigms:

In CKDA for VI-LReID, MCP sits before a ViT-B/16 backbone and processes each input image through tokenization, instance normalization (to erase style), adaptive masking via two-layer MLP gates, fusion of normalized and original features, and final writing to a set of learnable prompt tokens $\bm{k}_{com}$ . These are then injected into the transformer's attention stream as common-factor tokens (Cui et al., 19 Nov 2025).
In EPE-P for multimodal learning with missing modalities, MCP maintains a single global prompt matrix $B$ , partitioned into $m\times m$ blocks for $m$ modalities. When modalities are missing, small per-modality prompt-weight matrices $A_{M_i}$ select (via block-wise multiplication) the relevant sub-prompt for the current missing-pattern scenario, yielding a parameter-efficient construction that precludes the exponential blow-up of missing-aware prompts (Chen et al., 2024).
In PromptFuse/BlindPrompt, MCP is realized as a shared prompt embedding matrix $P$ concatenated with frozen per-modality encoder outputs before being processed by a downstream LLM. No per-modality parameters are introduced, and only $P$ is trained—ensuring extreme modularity and parameter economy (Liang et al., 2022).

MCP Variant	Parameterization	Core Injection Point
CKDA MCP	Instance-norm + adaptive fusion + MLP	Prepend to ViT backbone tokens
EPE-P MCP	Global prompt + per-modality weights	Prepend to ViLT/transformer layers
PromptFuse MCP	Shared prompt matrix $P$	Prepend to PLM and modality tokens

3. Mathematical Formulation and Feature Isolation

The essence of MCP is the explicit mathematical disentanglement of modality-common information.

For CKDA:

After tokenization and embedding, a feature $\bm{x}_{ori}$ is instance-normalized to yield $\bm{x}_{in}$ . MLP-based masks $e^o$ and $e^i$ are computed:

$e^o = \sigma(W_2^o \delta(W_1^o \bm{x}_{ori})),\quad e^i = \sigma(W_2^i \delta(W_1^i \bm{x}_{in}))$

Fused output:

$\bm{x}_{com} = e^o \odot \bm{x}_{ori} + (1-e^o)\odot(e^i\odot\bm{x}_{in})$

Decoded to final prompt tokens:

$\bm{k}_{com} = \mathcal{E}_{pc}(\delta(\text{patch}(\bm{x}_{com})))$

For EPE-P:

For missing modality set $\mathcal M_{miss}$ , sum their prompt weights:

$A_{\text{sum}} = \sum_{i=1}^m \mathbf{1}_{\mathcal M_{\text{miss}}}(M_i) A_{M_i}$

Use block-wise Kronecker-like multiplication:

$P = A_{\text{sum}} \divideontimes B$

where $B$ is block-decomposed and each block factorized as $B_{ij} = u_{ij} v_{ij}^\top$ .

Resulting prompt $P$ is prepended in the transformer encoder.

For PromptFuse:

A shared prompt matrix $P \in \mathbb{R}^{N\times d}$ is concatenated as

$H^0 = \text{concat}(P, E_{m_1}, \dots, E_{m_K})$

where each $E_{m_i}$ is an embedding sequence from frozen encoder $f_{m_i}$ .

This formalism ensures that only modality-invariant factors are represented in $\bm{k}_{com}$ or $P$ , optimally leveraging multi-source data without cross-modality interference.

4. Optimization Strategies and Loss Formulations

MCP parameters are integrated into global training objectives alongside other module parameters:

CKDA Optimization: The overall loss comprises standard classification and triplet components ( $\mathcal L_{ce}$ , $\mathcal L_{trip}$ ), a prompting loss $\mathcal L_p$ that penalizes deviation of the current prompt from its earlier stage (to reduce forgetting),

$\mathcal L_p = \|\bm{k}_p^{m,(s)}-\bm{k}_p^{m,(s-1)}\|_1$

and alignment losses from the CKA module that propagate gradients through the backbone into MCP (Cui et al., 19 Nov 2025).

EPE-P Optimization: The total loss is an Evidence-based Loss

$L = (1-\lambda) L_{eb} + \lambda L_{KL}$

combining Dirichlet-based evidence regularization and KL-divergence to mitigate overconfident predictions under uncertainty induced by missing modalities (Chen et al., 2024).

PromptFuse Optimization: Standard cross-entropy loss is used; only the common prompt is trainable, with frozen backbone and per-modality encoders (Liang et al., 2022).

5. Empirical Results and Parameter Efficiency

Across benchmarks and scenarios, MCP exhibits robust performance improvement and model efficiency.

CKDA Ablation (VI-LReID, Table 3):

| Configuration | mAP | R-1 | |----------------------|------|------| | Base (no prompts) | 31.8 | 33.9 | | + MCP only | 33.4 | 35.2 | | + MCP + MSP | 34.6 | 37.4 | | + CKA only | 34.9 | 37.9 | | Full CKDA | 36.3 | 39.4 | A gain of +1.6 mAP and +1.3 R-1 is attributable to MCP’s explicit separation of common features (Cui et al., 19 Nov 2025).

EPE-P (50–60% random missing modalities):

| Method | MM-IMDb F1-Macro | Hateful Memes AUROC | |--------------|------------------|---------------------| | ViLT | ∼39–42 | ∼60–63 | | MAP | ∼43–44 | ∼62–65 | | EPE-P (full) | ∼46–48 | ∼64–67 | EPE-P attains +2.60 and +3.23 improvements at increasing missing rates, with additional gains from its evidence-based loss (Chen et al., 2024).

PromptFuse Parameter Comparison:

| Fusion Method | # Trainable Parameters | |-----------------|-----------------------| | Finetune VE | 86M | | JointProj | 1M | | PromptFuse | 15K | | BlindPrompt | 15K |

MCP’s parameter overhead is minimal, often less than 0.1% of baseline finetuning or dense adapters, while supporting modular expansion (new modalities added with no retraining of previous prompts).

6. Interactions with Complementary Modules and Practical Variants

In composite architectures, MCP typically interfaces with modality-specific prompting (MSP) and cross-modal alignment (CKA) modules:

CKDA: MCP yields $\bm{k}_{com}$ , MSP yields per-modality prompts $\bm{k}_{spe}^m$ ; the sum is injected as the full prompt, enabling both common feature learning and modality adaptation. CKA separately aligns these representations in prototype space to maintain independence and consistency (Cui et al., 19 Nov 2025).
Scalability and Robustness: MCP has been demonstrated across 2- and 3-modality problems (vision-language, vision-language-audio) and is robust to varied prompt lengths, injection positions, and PLM backbones. Qualitative analysis demonstrates that MCP tokens strongly attend to modality-invariant attributes (contours, body shape), whereas MSP or per-modality parameters capture style, color, or sensor artifacts (Cui et al., 19 Nov 2025, Liang et al., 2022).

7. Limitations, Open Challenges, and Future Directions

Despite its efficiency and generality, MCP has recognized limitations:

Scalability to many modalities: EPE-P’s unified prompt extraction and PromptFuse’s single-prompt instantiation have been empirically validated primarily for bi-modal and tri-modal tasks. Extension to a larger number of modalities ( $m>2$ ) is a prominent direction (Chen et al., 2024).
Prompt composition: Current block-wise multiplication and weight summation schemes may be suboptimal for complex modality-absent patterns. Future work suggests learned attention or gating could generalize prompt selection beyond linear or additive mechanisms (Chen et al., 2024).
Uncertainty modeling: The adoption of evidence-based losses formalizes uncertainty calibration under missing information but remains limited. Richer uncertainty quantification (e.g., hierarchical Dirichlet, meta-learning to adapt prompt usage rates) offers further promise (Chen et al., 2024).

A plausible implication is that, as prompt-based fusion and disentanglement spread to higher-order multimodal settings and continual learning, MCP will require integration with more dynamic prompt selection, alignment strategies, and advanced regularization to fully realize its parameter and generalization advantages across broader machine perception tasks.

References:

CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification (Cui et al., 19 Nov 2025) EPE-P: Evidence-based Parameter-efficient Prompting for Multimodal Learning with Missing Modalities (Chen et al., 2024) Modular and Parameter-Efficient Multimodal Fusion with Prompting (Liang et al., 2022)

PDF Markdown Chat (Pro)

References (3)

CKDA: Cross-modality Knowledge Disentanglement and Alignment for Visible-Infrared Lifelong Person Re-identification (2025)

EPE-P: Evidence-based Parameter-efficient Prompting for Multimodal Learning with Missing Modalities (2024)

Modular and Parameter-Efficient Multimodal Fusion with Prompting (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Modality-Common Prompting (MCP).

Modality-Common Prompting (MCP)

1. Conceptual Overview and Motivation

2. MCP Architectures Across Modalities and Tasks

3. Mathematical Formulation and Feature Isolation

4. Optimization Strategies and Loss Formulations

5. Empirical Results and Parameter Efficiency

6. Interactions with Complementary Modules and Practical Variants

7. Limitations, Open Challenges, and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Modality-Common Prompting (MCP)

1. Conceptual Overview and Motivation

2. MCP Architectures Across Modalities and Tasks

3. Mathematical Formulation and Feature Isolation

4. Optimization Strategies and Loss Formulations

5. Empirical Results and Parameter Efficiency

6. Interactions with Complementary Modules and Practical Variants

7. Limitations, Open Challenges, and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research