Partitioned Multi-modal Prompting (PMPO)

Updated 7 April 2026

PMPO is a parameter-efficient prompt-based adaptation technique that partitions prompts across modalities, depth, and clients to mitigate exponential parameter growth.
It injects independent, orthogonally regularized prompt matrices into frozen transformer backbones, dynamically composing them to handle missing modalities and diverse task requirements.
Empirical results demonstrate that PMPO boosts benchmarks with improved accuracy and reduced parameter counts, supporting robust performance in federated and multi-task settings.

Partitioned Multi-modal Prompting (PMPO) refers to a class of parameter-efficient prompt-based adaptation techniques designed for multimodal architectures, especially vision-language and multimodal transformers. PMPO aims to address core challenges in scalable, robust multimodal adaptation—namely, efficient prompt parameterization in the presence of an exponential number of modality/missingness cases, capture of diverse modality- and layer-specific cues, and transferability across varied downstream tasks and data regimes. Central to PMPO is the notion of “partitioning” either along modality, feature, task, depth, or client axes, with independent or orthogonally regularized prompt vectors that can be dynamically assembled or composed at inference.

1. Core Formalism and Architectural Principles

PMPO operates on a backbone consisting of frozen multimodal transformer stacks (e.g., ViLT, CLIP). Let $M$ denote the number of modalities. The partitioned prompt strategy assigns a distinct learnable prompt matrix $P_i\in\mathbb{R}^{L\times d}$ for each modality $i\in\{1,\dots,M\}$ , with $L$ the prompt length and $d$ the hidden size (Jang et al., 2023). At inference:

If only modality $i$ is present, $P_i$ is prepended to the corresponding token stream.
If multiple modalities are present, a “complete” prompt is constructed by elementwise summation of present modality prompts:

$P_{\rm complete} = \sum_{i\in \text{present}} P_i.$

Only the prompt embeddings and a lightweight classifier or projection head are updated during adaptation; all backbone and modality encoders are frozen.

Alternative partitioning schemes apply prompts by depth (assigning a unique prompt to each block or stage within the encoder) (Tian et al., 2023), by client (federated PMPO with inter/intra-client pooling) (Phung et al., 6 Feb 2026), or by semantic factor (scene/distortion in Blind IQA) (Pan et al., 2024).

Orthogonality regularization is frequently imposed between modality-specific prompt matrices to enhance diversity and ensure non-redundant representation:

$L_{\rm ortho} = \sum_{i<j} \frac{|f(P_i)\cdot f(P_j)|}{\max(\|f(P_i)\|_2 \|f(P_j)\|_2,\;\epsilon)},$

where $f$ denotes vectorization and $P_i\in\mathbb{R}^{L\times d}$ 0 is a small constant (Jang et al., 2023).

2. Comparison to Missing-Aware and Parameter-Intensive Prompting

Traditional multimodal prompt-tuning, such as “missing-aware” designs, allocate a dedicated prompt for every subset of available modalities, leading to a parameter count of $P_i\in\mathbb{R}^{L\times d}$ 1, growing exponentially with $P_i\in\mathbb{R}^{L\times d}$ 2 (Jang et al., 2023). PMPO’s partitioning into $P_i\in\mathbb{R}^{L\times d}$ 3 modality-specific prompts results in linear scaling ( $P_i\in\mathbb{R}^{L\times d}$ 4), reducing parameter count by a ratio $P_i\in\mathbb{R}^{L\times d}$ 5.

For instance, in a 3-modality setup, parameter reduction is from 7 prompts to 3, achieving a 57% savings. In federated contexts, prompt pools are partitioned for inter- vs. intra-client patterns and iteratively aggregated or clustered, further optimizing efficiency (Phung et al., 6 Feb 2026).

3. Integrations and Extensions: Depth, Task, and Semantic Partitioning

PMPO extends beyond modality-based partitioning:

Depth partitioning: Distinct prompts are injected at each stage of the visual encoder (e.g. ViT blocks) (Tian et al., 2023). This enables hierarchical conditioning, with lower-layer prompts encoding fine texture and higher-layer prompts enriching semantic abstraction.
Task and group partitioning: For multi-task settings, prompts are partitioned by grouped task affinity (via gradient clustering) and further fine-tuned per task, with alignment losses enforcing consistent representations across vision and language (Xin et al., 2023).
Semantic partitioning: In BIQA, separate text-branch prompts are optimized for scene versus distortion recognition, and visual deep prompts are learned per layer (Pan et al., 2024). This structure promotes both specificity (scene/distortion cues) and compositionality (joint fusion).

4. Training Protocols, Regularization, and Optimization

Across PMPO instances, the backbone (visual, textual encoders, and transformer blocks) remains frozen throughout fine-tuning, with only prompt parameters and output heads trained (Jang et al., 2023, Tian et al., 2023, Pan et al., 2024).

Objectives combine:

Classification or regression loss (e.g., cross-entropy, L1/smooth L1 for quality scores),
Orthogonality or alignment constraints on prompts (to enforce diversity or semantic closeness),
Task/group alignment (e.g., via Euclidean or cosine similarity between visual/language features in shared embedding space).

Sampling protocols during training frequently include random modality-dropping, promoting inference robustness to arbitrary missingness (Jang et al., 2023).

In federated PMPO, partitioned prompt pools are updated locally; inter-client prompts are clustered/aligned, and intra-client prompts are globally averaged (FedAvg). Client-side selection is via learned query/key metrics, ensuring only the relevant prompt subset is used for each instance (Phung et al., 6 Feb 2026).

5. Empirical Results and Comparative Benchmarks

Partitioned PMPO consistently outperforms both standard frozen and missing-aware prompt-tuning baselines:

Task/Domain	Baseline	PMPO Variant	Metric / Gain
MM-IMDb, 70% missing	Frozen ViLT: 34.8	PMPO: 39.4	F1-Macro, +4.6 (avg)
UPMC Food-101, 70% missing	Frozen ViLT: 53.9	PMPO: 65.3	Accuracy, +11.4% (avg)
Office-Home (MTL)	Full FT: 86.1%	Group+Task-PMPO: 86.5%	Acc, only 0.09% parameters
Live IQA (PLCC/SRCC)	DBCNN: 0.971/0.968	PMPO: 0.980/0.978	SOTA on all standard splits
Federated UPMC Food-101	Centralized: 68.2	FED-PRIME: 80.4	Miss-both test acc, +64.48%
Multi-dataset Harmonic Mean	CoOp: 71.66	PMPO: 79.27	New-class gen., +7.62%

Consistent robustness is observed in held-out and unseen missing-combination splits, with missing-aware baselines degrading sharply outside trained missing cases (Jang et al., 2023, Phung et al., 6 Feb 2026). In BIQA with partitioned prompts, ablations demonstrate clear performance drops when omitting scene/distortion/visual prompt branches (Pan et al., 2024).

6. Design Ablations, Limitations, and Future Directions

Ablation studies show that

Orthogonality and group/task alignment losses yield incremental but consistent improvements in accuracy and generalization (Jang et al., 2023, Xin et al., 2023).
Increasing the number of prompts by depth or semantics plateaus or slightly overfits after a certain threshold (typically $P_i\in\mathbb{R}^{L\times d}$ 6 for PMPO in vision-language transfer) (Tian et al., 2023).
Hierarchical or deep visual/text prompting offers tangible benefit in domain and cross-dataset generalization; omitting deep prompts reduces both in-domain and cross-domain scores.

Limitations include linear computational cost in the number of prompt partitions, increased inference latency with depth or semantic partitioning, and a need for more shots in the few-shot regime to robustly learn multiple prompts (Tian et al., 2023). Current methods treat prompt assignment statically; future methods may explore dynamic or adaptive prompt selection per instance, as well as distillation into compact expert-prompt sets.

PMPO stands as a scalable alternative to exhaustive prompt enumeration and a functional enhancement of soft-prompt transfer learning for multimodal transformers. It is extended in federated and multi-task learning via prompt pools, combinatorial alignment, and clustering. Its principles are consistent with information-theoretic doctrines advocating diversity and non-redundancy in representational bottlenecks, and its empirical gains support the theoretical conjecture that structured prompt partitioning can extract richer and more robust cross-modal representations (Jang et al., 2023, Xin et al., 2023, Phung et al., 6 Feb 2026, Tian et al., 2023, Pan et al., 2024).

A plausible implication is that the partitioned prompt paradigm is foundational for scalable and robust parameter-efficient adaptation in the presence of modality missingness, cross-task transfer, and federated heterogeneity.