Modality Experience Pool (MEP)

Updated 15 November 2025

Modality Experience Pool (MEP) is a dynamic key–value memory module that archives past text and visual embeddings with rationales to enhance stance detection.
It employs cosine similarity-based retrieval and dual-process reasoning to selectively fuse modality signals using parameters like α and threshold τ.
Its update mechanism uses in-context chain-of-thought prompting to iteratively refine stored heuristics, improving multimodal decision-making.

The Modality Experience Pool (MEP) is a central component within the ReMoD framework for multimodal stance detection, functioning as a learned, dynamically updating key–value memory that encodes “what worked in the past” when combining multiple modalities (primarily text and vision) for stance analysis. MEP is specifically designed to optimize the weighting of modalities by leveraging historical experience, thus allowing the stance detection model to adaptively amplify or attenuate signals from different modalities based on contextual reliability, as learned through prior correct decisions.

1. Structural Definition and Embedding Scheme

At any stage of training or inference, the MEP consists of $M$ entries, each represented as $E_j = (k_u^j, k_V^j; v_j)$ . Here, $k_u^j \in \mathbb{R}^d$ denotes the embedding produced by the BGE-M3 text encoder for the text modality, $k_V^j \in \mathbb{R}^d$ denotes the output of the MLLM image encoder for the vision modality, and $v_j$ is a natural-language rationale summarizing the heuristic or reasoning that led to a correct stance judgment in a historical instance. The dimensionality $d$ aligns with the respective backbone encoders, typically 1,024 or 2,048, thus maintaining architectural interoperability and minimizing additional projection overhead.

This structure ensures that each past instance contributing to the pool is directly indexed by its semantic and visual key representations, along with an LLM-generated rationale that captures domain-specific decision strategies beyond simple label imitation.

2. Role in Dual-Process Reasoning: Intuitive and Reflective Stages

The MEP operationalizes the “experience-driven intuitive reasoning” phase of ReMoD’s dual-reasoning paradigm. For a new input comprising text $u$ and segmented visual cues $V = \{\bar{v}_1, ..., \bar{v}_N\}$ , the model first encodes

$q_u = f_u(u) \in \mathbb{R}^d, \quad q_V = \frac{1}{N} \sum_{i=1}^N f_V(\bar{v}_i) \in \mathbb{R}^d,$

where $f_u$ and $f_V$ are respective BGE-M3 and MLLM encoders. Each historical experience $E_j$ in the MEP is scored via a weighted blend of cosine similarities:

$S_u^j = \cos(q_u, k_u^j), \quad S_V^j = \cos(q_V, k_V^j), \ S(E_j) = \alpha S_u^j + (1-\alpha) S_V^j,$

with $\alpha \in [0, 1]$ balancing textual versus visual similarity. All $E_j$ with $S(E_j) \geq \tau$ are retained, and the top- $k$ are selected.

The values $v_{j_1}, ..., v_{j_k}$ of the retrieved experiences are used as in-context exemplars for an LLM agent ( $A_R$ ) during chain-of-thought (CoT) prompting, yielding initial stance hypotheses:

$(\hat{y}_u, r_u), \ (\hat{y}_V, r_V), \ (\hat{y}_{uV}, r_{uV}),$

for unimodal and fused contexts.

The reflective (“deliberate reflective reasoning”) phase operates through the Modality-CoT chain: once ground-truth $y$ is obtained, a distilled Modality Insight $I_m$ —a concise summary of which modality was most reliable for this instance—is produced. The MEP is then updated in a manner that preserves or generalizes prior rationales based on their relevance and the new insight, always mediated by prompting rather than gradient-based updates.

3. Update Mechanisms and Memory Dynamics

The MEP update occurs as follows:

Novel Instance (No close prior match, $S(E_{j_i}) < \tau$ ):

Add $(k_u=q_u, k_V=q_V, v=I_m)$ to the pool as a new entry.

Existing Match:

For each $E_{j_i}$ in the retrieved subset, generate an updated value $v'_{j_i}$ by prompting the LLM:

$v'_{j_i} = A_R(\text{“Given old experience: } v_{j_i} \text{ and new insight: } I_m, \text{ rewrite a more general heuristic.”})$

The key embeddings remain fixed, only the rationale is replaced.

This update mechanism eschews direct gradient-based modification of MEP content (“pool-loss” is absent), decoupling pool memory evolution from the end-to-end loss landscape. Supervision via cross-entropy only applies to the final stance prediction layer.

High-Level Pseudocode

Let MEP ← ∅
for epoch = 1…E do
  for each training example (u, V, y) do
    // Encode & Retrieve
    q_u ← f_u(u),  q_V ← f_V(V)
    compute S(E_j) for all E_j ∈ MEP
    R ← top-k(E_j : S(E_j)≥τ)

    // Intuitive reasoning
    for c ∈ {u, V, (u,V)} do
      (ŷ_c, r_c) ← A_R(u, V, {v_j : E_j∈R}, context=c)

    // Reflective Modality‐CoT
    derive I_m from {(ŷ_c, r_c)} versus y

    // Update MEP
    if R = ∅ then
      MEP.append((q_u, q_V; I_m))
    else
      for each E_j in R do
        v_j ← A_R(“Merge old v_j with new insight I_m.”)
end for

At inference, only retrieval and LLM-driven hypothesis generation are performed; pool updates are omitted.

4. Hyperparameterization and Operational Behavior

Several hyperparameters govern MEP retrieval and update:

Parameter	Description	Effect/Optimal Value
$\alpha$	Modality-fusion weight in $S(E_j)$	$\alpha=0.7$ optimal; >0.7 overweights text, <0.7 allows visual noise
$\tau$	Relevance threshold for exemplar selection	~0.8 is optimal; <0.5 admits noise, >0.9 yields too few examples
$k$	Maximum number of retrieved exemplars	$k=3$ used; larger $k$ increases coverage but yields diminishing returns and longer LLM prompts

Tuning these values directly affects the model’s ability to recall useful prior experiences, filter noise, and maintain compact, informative in-context prompts.

5. Mechanism for Robust Modality Weighting

The core function of the MEP is to enable ReMoD to learn and recall nuanced heuristics that govern the relative reliability and expressive power of each modality. Through iterative retrieval and rationale updating, the framework adapts its stance inference not merely on current data but via historical reasoning sequences that have proven effective when text-visual cue conflicts arise.

For example, in a case where a tweet contains both apparent endorsement language (“I stan Putin”) and negative sentiment (“what a cringe move”), paired with a protest image, a traditional CoT may interpret the text as pro-Putin. By contrast, MEP retrieval surfaces prior resolutions where textual contradiction (“stan…cringe”) was best resolved by down-weighting superficial positive cues in favor of the broader negative context. The Modality Insight generated (e.g., “prioritize negative terms over surface endorsements”) is then generalized into the pool, so similar contradictions are robustly handled in future cases.

6. Deployment, Efficiency, and Limitations

During inference, the trained MEP acts as a static key–value store. Querying is computationally efficient, requiring only encoding new data into $q_u$ , $q_V$ , and computing cosine similarities. Prompted LLM reasoning adapts based on retrieved exemplars, obviating the need for further training or explicit error correction on new data.

Limitations of MEP include reliance on the diversity and coverage of stored experiences; under-represented modality combinations may not yield robust generalizations. Since updates are LLM-mediated rather than gradient-based, the pace of adaptation is governed by in-context learning dynamics and the quality of rationale synthesis, rather than explicit optimization objectives. This architecture reflects a trade-off: improved interpretability and “meta-reasoned” adaptation, at the expense of slower convergence to optimal heuristics in domains with rapidly shifting modality distributions.

7. Significance in Multimodal Stance Detection

The MEP advances the state of the art in adaptive fusion for multimodal stance detection by realizing a feedback loop where model reasoning about modality contributions is continuously codified, expanded, and leveraged. It departs from prior approaches that treat modality fusion as static or uniformly weighted, enabling real-time rebalancing based on context-sensitive prior knowledge embodied in chain-of-thought rationales. Empirical results on the MMSD benchmark demonstrate significant performance gains and generalization capability, indicating the value of dynamic modality experience pooling in mitigating stance misunderstanding noise stemming from naïve modality combination (Wang et al., 8 Nov 2025).

A plausible implication is that MEP-like constructs could generalize to other tasks requiring dynamic modality integration, notably where ground-truth cues for modality reliability shift across contexts or over time.

PDF Markdown Chat (Pro)

References (1)

ReMoD: Rethinking Modality Contribution in Multimodal Stance Detection via Dual Reasoning (2025)

Follow Topic

Get notified by email when new papers are published related to Modality Experience Pool (MEP).