Modality Experience Pool (MEP)
- Modality Experience Pool (MEP) is a dynamic key–value memory module that archives past text and visual embeddings with rationales to enhance stance detection.
- It employs cosine similarity-based retrieval and dual-process reasoning to selectively fuse modality signals using parameters like α and threshold τ.
- Its update mechanism uses in-context chain-of-thought prompting to iteratively refine stored heuristics, improving multimodal decision-making.
The Modality Experience Pool (MEP) is a central component within the ReMoD framework for multimodal stance detection, functioning as a learned, dynamically updating key–value memory that encodes “what worked in the past” when combining multiple modalities (primarily text and vision) for stance analysis. MEP is specifically designed to optimize the weighting of modalities by leveraging historical experience, thus allowing the stance detection model to adaptively amplify or attenuate signals from different modalities based on contextual reliability, as learned through prior correct decisions.
1. Structural Definition and Embedding Scheme
At any stage of training or inference, the MEP consists of entries, each represented as . Here, denotes the embedding produced by the BGE-M3 text encoder for the text modality, denotes the output of the MLLM image encoder for the vision modality, and is a natural-language rationale summarizing the heuristic or reasoning that led to a correct stance judgment in a historical instance. The dimensionality aligns with the respective backbone encoders, typically 1,024 or 2,048, thus maintaining architectural interoperability and minimizing additional projection overhead.
This structure ensures that each past instance contributing to the pool is directly indexed by its semantic and visual key representations, along with an LLM-generated rationale that captures domain-specific decision strategies beyond simple label imitation.
2. Role in Dual-Process Reasoning: Intuitive and Reflective Stages
The MEP operationalizes the “experience-driven intuitive reasoning” phase of ReMoD’s dual-reasoning paradigm. For a new input comprising text and segmented visual cues , the model first encodes
where and are respective BGE-M3 and MLLM encoders. Each historical experience in the MEP is scored via a weighted blend of cosine similarities:
with balancing textual versus visual similarity. All with are retained, and the top- are selected.
The values of the retrieved experiences are used as in-context exemplars for an LLM agent () during chain-of-thought (CoT) prompting, yielding initial stance hypotheses:
for unimodal and fused contexts.
The reflective (“deliberate reflective reasoning”) phase operates through the Modality-CoT chain: once ground-truth is obtained, a distilled Modality Insight —a concise summary of which modality was most reliable for this instance—is produced. The MEP is then updated in a manner that preserves or generalizes prior rationales based on their relevance and the new insight, always mediated by prompting rather than gradient-based updates.
3. Update Mechanisms and Memory Dynamics
The MEP update occurs as follows:
- Novel Instance (No close prior match, ):
Add to the pool as a new entry.
- Existing Match:
For each in the retrieved subset, generate an updated value by prompting the LLM:
The key embeddings remain fixed, only the rationale is replaced.
This update mechanism eschews direct gradient-based modification of MEP content (“pool-loss” is absent), decoupling pool memory evolution from the end-to-end loss landscape. Supervision via cross-entropy only applies to the final stance prediction layer.
High-Level Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Let MEP ← ∅
for epoch = 1…E do
for each training example (u, V, y) do
// Encode & Retrieve
q_u ← f_u(u), q_V ← f_V(V)
compute S(E_j) for all E_j ∈ MEP
R ← top-k(E_j : S(E_j)≥τ)
// Intuitive reasoning
for c ∈ {u, V, (u,V)} do
(ŷ_c, r_c) ← A_R(u, V, {v_j : E_j∈R}, context=c)
// Reflective Modality‐CoT
derive I_m from {(ŷ_c, r_c)} versus y
// Update MEP
if R = ∅ then
MEP.append((q_u, q_V; I_m))
else
for each E_j in R do
v_j ← A_R(“Merge old v_j with new insight I_m.”)
end for |
At inference, only retrieval and LLM-driven hypothesis generation are performed; pool updates are omitted.
4. Hyperparameterization and Operational Behavior
Several hyperparameters govern MEP retrieval and update:
| Parameter | Description | Effect/Optimal Value |
|---|---|---|
| Modality-fusion weight in | optimal; >0.7 overweights text, <0.7 allows visual noise | |
| Relevance threshold for exemplar selection | ~0.8 is optimal; <0.5 admits noise, >0.9 yields too few examples | |
| Maximum number of retrieved exemplars | used; larger increases coverage but yields diminishing returns and longer LLM prompts |
Tuning these values directly affects the model’s ability to recall useful prior experiences, filter noise, and maintain compact, informative in-context prompts.
5. Mechanism for Robust Modality Weighting
The core function of the MEP is to enable ReMoD to learn and recall nuanced heuristics that govern the relative reliability and expressive power of each modality. Through iterative retrieval and rationale updating, the framework adapts its stance inference not merely on current data but via historical reasoning sequences that have proven effective when text-visual cue conflicts arise.
For example, in a case where a tweet contains both apparent endorsement language (“I stan Putin”) and negative sentiment (“what a cringe move”), paired with a protest image, a traditional CoT may interpret the text as pro-Putin. By contrast, MEP retrieval surfaces prior resolutions where textual contradiction (“stan…cringe”) was best resolved by down-weighting superficial positive cues in favor of the broader negative context. The Modality Insight generated (e.g., “prioritize negative terms over surface endorsements”) is then generalized into the pool, so similar contradictions are robustly handled in future cases.
6. Deployment, Efficiency, and Limitations
During inference, the trained MEP acts as a static key–value store. Querying is computationally efficient, requiring only encoding new data into , , and computing cosine similarities. Prompted LLM reasoning adapts based on retrieved exemplars, obviating the need for further training or explicit error correction on new data.
Limitations of MEP include reliance on the diversity and coverage of stored experiences; under-represented modality combinations may not yield robust generalizations. Since updates are LLM-mediated rather than gradient-based, the pace of adaptation is governed by in-context learning dynamics and the quality of rationale synthesis, rather than explicit optimization objectives. This architecture reflects a trade-off: improved interpretability and “meta-reasoned” adaptation, at the expense of slower convergence to optimal heuristics in domains with rapidly shifting modality distributions.
7. Significance in Multimodal Stance Detection
The MEP advances the state of the art in adaptive fusion for multimodal stance detection by realizing a feedback loop where model reasoning about modality contributions is continuously codified, expanded, and leveraged. It departs from prior approaches that treat modality fusion as static or uniformly weighted, enabling real-time rebalancing based on context-sensitive prior knowledge embodied in chain-of-thought rationales. Empirical results on the MMSD benchmark demonstrate significant performance gains and generalization capability, indicating the value of dynamic modality experience pooling in mitigating stance misunderstanding noise stemming from naïve modality combination (Wang et al., 8 Nov 2025).
A plausible implication is that MEP-like constructs could generalize to other tasks requiring dynamic modality integration, notably where ground-truth cues for modality reliability shift across contexts or over time.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free