Multimodal Ensemble Retriever (MER)
- The paper introduces MER, which decouples vision and text retrieval signals to robustly handle discordant image-text pairs.
- MER employs a bi-encoder architecture using CLIP and BGE, fusing cosine similarities through a tunable fusion weight to optimize exemplar selection.
- Empirical evaluations on the MUN and A-OKVQA benchmarks demonstrate significant improvements in retrieval performance for in-context learning tasks.
The Multimodal Ensemble Retriever (MER) is a dual-modality retrieval system developed to robustly identify semantically analogous cases in multimodal corpora, particularly when image and text pairs are intentionally mismatched or "discordant." Originally introduced in the context of the Multimodal UNcommonsense (MUN) benchmark to advance retrieval-based in-context learning (R-ICL), MER enables downstream visual-LLMs (VLMs) to perform abductive commonsense reasoning by supplying them with highly relevant exemplars selected from corpora where standard cross-modal alignment is systematically undermined (Son et al., 2 Feb 2026).
1. Motivation and Problem Context
MER addresses the retrieval challenge posed by the MUN benchmark, where inputs often consist of benign images with odd textual outcomes or ordinary text accompanying atypical images. Conventional cross-modal retrievers, which assume tight vision–language alignment, underperform in these contexts because their joint embedding spaces fail to represent meaningful analogies when modalities are discordant. MER mitigates this by decoupling the retrieval signals from vision and text, then fusing them in a manner that adapts flexibly to the signal strengths of each modality.
The principal objective within the R-ICL pipeline is to select the most informative image+text exemplars for a given query, enabling a smaller VLM to benefit from carefully curated in-context examples during abductive reasoning tasks.
2. Architecture and Retrieval Mechanism
MER utilizes a bi-encoder architecture, maintaining separate encoders for vision and text. Each encoder is frozen, with no end-to-end retraining:
- Vision-only branch: Employs a CLIP ViT-B/16 image encoder (), generating 512-dimensional embeddings, .
- Text-only branch: Uses a BGE-Large-en encoder (), producing 1024-dimensional embeddings, .
For a query and a database entry , MER computes the cosine similarities:
The ensemble score is then defined as:
where is a modality fusion weight controlling the relative contribution of the vision and text branches. The top- entries by are selected for downstream use.
A cross-modal retrieval head () is theoretically extensible but not employed in the present study.
Retrieval Pseudocode:
1 2 3 4 5 6 7 |
v_q_i = E_I(q_i) v_q_t = E_T(q_t) for each database entry j: s_vj = cos_sim(v_q_i, v_{i_j}) s_tj = cos_sim(v_q_t, v_{t_j}) S_j = α*s_vj + (1–α)*s_tj return top-k entries sorted by S_j |
3. Calibration and Hyperparameter Tuning
The only tunable parameter in MER is the fusion weight . Ablation studies on a held-out MUN-vis subset (using the Phi-4-mm backbone) demonstrated optimal win-rate performance at over a grid , and this value was used consistently throughout further evaluations. No further end-to-end learning or adaptation of the encoders was performed; both and remain fixed.
This suggests that a modest preference for the text branch () yields the best retrieval performance when visual and textual signals are not equally informative.
4. Implementation Details
MER is implemented as follows:
- Preprocessing: Images are resized to 512×512, center-cropped, normalized by ImageNet statistics, then encoded via CLIP ViT-B/16. Text data are lowercased, whitespace-normalized, and tokenized using the BGE-Large tokenizer.
- Backbones and Storage: Encoded vectors are stored as fixed-length dense embeddings in a vector database (FAISS Flat or HNSW), with each entry retaining both and for paired retrieval.
- Retrieval Library: The system leverages LangChain with a custom wrapper.
- Scalability: Embeddings are precomputed offline for approximately 700 corpus examples, permitting query times of $1$–$2$ ms per modality on an A6000 GPU.
- Corpus Size: Designed and benchmarked on small-to-moderate databases; for larger corpora, sharding by modality and cross-modal reranking is suggested as a scalable extension.
5. Empirical Results and Ablations
Extensive experiments on the MUN and A-OKVQA benchmarks validate MER's effectiveness:
| Retrieval Setting | Win-rate (Phi-4-mm, MUN-vis) | Gain vs. Baseline |
|---|---|---|
| Zero-shot | 0.572 | – |
| Random 1-shot | 0.582 | – |
| MER 1-shot (R-ICL) | 0.618 | +3.6 ppt vs rand |
| MER 5-shot | 0.704 | +7.1 ppt vs rand5 |
Additional evaluation (FLASK skill metrics) across seven VLMs shows consistent gains of to in Logical Robustness, Correctness, Efficiency, and Commonsense. MER outperformed random-shot retrieval in $12/14$ model/task configurations. On A-OKVQA with Qwen-2.5-VL, MER improved 1-shot accuracy from $0.832$ (random) to $0.842$ (MER).
Ablation on revealed that retrieval quality peaks at . Empirically, the MER ensemble mechanism systematically outperformed both zero-shot and random-shot baselines.
6. Limitations and Prospective Directions
Several limitations and proposed future enhancements are recognized:
- No end-to-end learning: With both encoders frozen and manually tuned, MER does not fully adapt to task specifics. Learning , or introducing a gating network trained via a contrastive retrieval loss, represents a natural improvement pathway.
- Lack of explicit cross-modal scoring: The framework omits a cross-modal encoder (), which could further enhance retrieval by modeling joint vision–language alignment, useful when both modalities contain complementary information.
- Scalability: The current approach is tractable for databases up to approximately examples. Scaling beyond that may require hierarchical retrieval, such as initial modality-wise sharding with subsequent cross-modal reranking.
- Domain Shift: Transfer to specialized domains (e.g., medical imaging) may necessitate additional fine-tuning or backbone replacement to compensate for out-of-distribution characteristics.
- Dynamic weighting: The fixed parameter could be upgraded to a query-dependent function , allowing MER to dynamically adjust its reliance on modalities according to assessed concordance or discordance.
A plausible implication is that MER's design principles—modality-specific scoring with adaptive fusion—establish a new standard for retrieval in tasks characterized by atypical or deliberately misaligned multimodal pairs.
7. Significance in Retrieval-Based In-Context Learning
Within the retrieval-based in-context learning paradigm, MER facilitates effective transference of reasoning exemplars from larger models to smaller VLMs without necessitating retraining. Its bi-encoder design, combined with fusion via a single prominently tuned hyperparameter, ensures robust performance even in low-frequency and "uncommonsense" scenarios represented in the MUN benchmark. The method's simplicity and modularity—allowing substitution of encoders and reuse of industry-standard tools such as FAISS and CLIP/BERT—promote reproducibility and ease of extension. MER's empirically demonstrated gains on both custom and established benchmarks highlight its utility as a generalizable component in multimodal retrieval and reasoning pipelines (Son et al., 2 Feb 2026).