Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Modality Experience Pool (MEP)

Updated 15 November 2025
  • Modality Experience Pool (MEP) is a dynamic key–value memory module that archives past text and visual embeddings with rationales to enhance stance detection.
  • It employs cosine similarity-based retrieval and dual-process reasoning to selectively fuse modality signals using parameters like α and threshold τ.
  • Its update mechanism uses in-context chain-of-thought prompting to iteratively refine stored heuristics, improving multimodal decision-making.

The Modality Experience Pool (MEP) is a central component within the ReMoD framework for multimodal stance detection, functioning as a learned, dynamically updating key–value memory that encodes “what worked in the past” when combining multiple modalities (primarily text and vision) for stance analysis. MEP is specifically designed to optimize the weighting of modalities by leveraging historical experience, thus allowing the stance detection model to adaptively amplify or attenuate signals from different modalities based on contextual reliability, as learned through prior correct decisions.

1. Structural Definition and Embedding Scheme

At any stage of training or inference, the MEP consists of MM entries, each represented as Ej=(kuj,kVj;vj)E_j = (k_u^j, k_V^j; v_j). Here, kujRdk_u^j \in \mathbb{R}^d denotes the embedding produced by the BGE-M3 text encoder for the text modality, kVjRdk_V^j \in \mathbb{R}^d denotes the output of the MLLM image encoder for the vision modality, and vjv_j is a natural-language rationale summarizing the heuristic or reasoning that led to a correct stance judgment in a historical instance. The dimensionality dd aligns with the respective backbone encoders, typically 1,024 or 2,048, thus maintaining architectural interoperability and minimizing additional projection overhead.

This structure ensures that each past instance contributing to the pool is directly indexed by its semantic and visual key representations, along with an LLM-generated rationale that captures domain-specific decision strategies beyond simple label imitation.

2. Role in Dual-Process Reasoning: Intuitive and Reflective Stages

The MEP operationalizes the “experience-driven intuitive reasoning” phase of ReMoD’s dual-reasoning paradigm. For a new input comprising text uu and segmented visual cues V={vˉ1,...,vˉN}V = \{\bar{v}_1, ..., \bar{v}_N\}, the model first encodes

qu=fu(u)Rd,qV=1Ni=1NfV(vˉi)Rd,q_u = f_u(u) \in \mathbb{R}^d, \quad q_V = \frac{1}{N} \sum_{i=1}^N f_V(\bar{v}_i) \in \mathbb{R}^d,

where fuf_u and fVf_V are respective BGE-M3 and MLLM encoders. Each historical experience EjE_j in the MEP is scored via a weighted blend of cosine similarities:

Suj=cos(qu,kuj),SVj=cos(qV,kVj), S(Ej)=αSuj+(1α)SVj,S_u^j = \cos(q_u, k_u^j), \quad S_V^j = \cos(q_V, k_V^j), \ S(E_j) = \alpha S_u^j + (1-\alpha) S_V^j,

with α[0,1]\alpha \in [0, 1] balancing textual versus visual similarity. All EjE_j with S(Ej)τS(E_j) \geq \tau are retained, and the top-kk are selected.

The values vj1,...,vjkv_{j_1}, ..., v_{j_k} of the retrieved experiences are used as in-context exemplars for an LLM agent (ARA_R) during chain-of-thought (CoT) prompting, yielding initial stance hypotheses:

(y^u,ru), (y^V,rV), (y^uV,ruV),(\hat{y}_u, r_u), \ (\hat{y}_V, r_V), \ (\hat{y}_{uV}, r_{uV}),

for unimodal and fused contexts.

The reflective (“deliberate reflective reasoning”) phase operates through the Modality-CoT chain: once ground-truth yy is obtained, a distilled Modality Insight ImI_m—a concise summary of which modality was most reliable for this instance—is produced. The MEP is then updated in a manner that preserves or generalizes prior rationales based on their relevance and the new insight, always mediated by prompting rather than gradient-based updates.

3. Update Mechanisms and Memory Dynamics

The MEP update occurs as follows:

  • Novel Instance (No close prior match, S(Eji)<τS(E_{j_i}) < \tau):

Add (ku=qu,kV=qV,v=Im)(k_u=q_u, k_V=q_V, v=I_m) to the pool as a new entry.

  • Existing Match:

For each EjiE_{j_i} in the retrieved subset, generate an updated value vjiv'_{j_i} by prompting the LLM:

vji=AR(“Given old experience: vji and new insight: Im, rewrite a more general heuristic.”)v'_{j_i} = A_R(\text{“Given old experience: } v_{j_i} \text{ and new insight: } I_m, \text{ rewrite a more general heuristic.”})

The key embeddings remain fixed, only the rationale is replaced.

This update mechanism eschews direct gradient-based modification of MEP content (“pool-loss” is absent), decoupling pool memory evolution from the end-to-end loss landscape. Supervision via cross-entropy only applies to the final stance prediction layer.

High-Level Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Let MEP ← ∅
for epoch = 1…E do
  for each training example (u, V, y) do
    // Encode & Retrieve
    q_u ← f_u(u),  q_V ← f_V(V)
    compute S(E_j) for all E_j ∈ MEP
    R ← top-k(E_j : S(E_j)≥τ)

    // Intuitive reasoning
    for c ∈ {u, V, (u,V)} do
      (ŷ_c, r_c) ← A_R(u, V, {v_j : E_j∈R}, context=c)

    // Reflective Modality‐CoT
    derive I_m from {(ŷ_c, r_c)} versus y

    // Update MEP
    if R = ∅ then
      MEP.append((q_u, q_V; I_m))
    else
      for each E_j in R do
        v_j ← A_R(“Merge old v_j with new insight I_m.”)
end for

At inference, only retrieval and LLM-driven hypothesis generation are performed; pool updates are omitted.

4. Hyperparameterization and Operational Behavior

Several hyperparameters govern MEP retrieval and update:

Parameter Description Effect/Optimal Value
α\alpha Modality-fusion weight in S(Ej)S(E_j) α=0.7\alpha=0.7 optimal; >0.7 overweights text, <0.7 allows visual noise
τ\tau Relevance threshold for exemplar selection ~0.8 is optimal; <0.5 admits noise, >0.9 yields too few examples
kk Maximum number of retrieved exemplars k=3k=3 used; larger kk increases coverage but yields diminishing returns and longer LLM prompts

Tuning these values directly affects the model’s ability to recall useful prior experiences, filter noise, and maintain compact, informative in-context prompts.

5. Mechanism for Robust Modality Weighting

The core function of the MEP is to enable ReMoD to learn and recall nuanced heuristics that govern the relative reliability and expressive power of each modality. Through iterative retrieval and rationale updating, the framework adapts its stance inference not merely on current data but via historical reasoning sequences that have proven effective when text-visual cue conflicts arise.

For example, in a case where a tweet contains both apparent endorsement language (“I stan Putin”) and negative sentiment (“what a cringe move”), paired with a protest image, a traditional CoT may interpret the text as pro-Putin. By contrast, MEP retrieval surfaces prior resolutions where textual contradiction (“stan…cringe”) was best resolved by down-weighting superficial positive cues in favor of the broader negative context. The Modality Insight generated (e.g., “prioritize negative terms over surface endorsements”) is then generalized into the pool, so similar contradictions are robustly handled in future cases.

6. Deployment, Efficiency, and Limitations

During inference, the trained MEP acts as a static key–value store. Querying is computationally efficient, requiring only encoding new data into quq_u, qVq_V, and computing cosine similarities. Prompted LLM reasoning adapts based on retrieved exemplars, obviating the need for further training or explicit error correction on new data.

Limitations of MEP include reliance on the diversity and coverage of stored experiences; under-represented modality combinations may not yield robust generalizations. Since updates are LLM-mediated rather than gradient-based, the pace of adaptation is governed by in-context learning dynamics and the quality of rationale synthesis, rather than explicit optimization objectives. This architecture reflects a trade-off: improved interpretability and “meta-reasoned” adaptation, at the expense of slower convergence to optimal heuristics in domains with rapidly shifting modality distributions.

7. Significance in Multimodal Stance Detection

The MEP advances the state of the art in adaptive fusion for multimodal stance detection by realizing a feedback loop where model reasoning about modality contributions is continuously codified, expanded, and leveraged. It departs from prior approaches that treat modality fusion as static or uniformly weighted, enabling real-time rebalancing based on context-sensitive prior knowledge embodied in chain-of-thought rationales. Empirical results on the MMSD benchmark demonstrate significant performance gains and generalization capability, indicating the value of dynamic modality experience pooling in mitigating stance misunderstanding noise stemming from naïve modality combination (Wang et al., 8 Nov 2025).

A plausible implication is that MEP-like constructs could generalize to other tasks requiring dynamic modality integration, notably where ground-truth cues for modality reliability shift across contexts or over time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Modality Experience Pool (MEP).