Multi-modal Multi-turn Memory Benchmark (MMMB)

Updated 1 January 2026

Multi-modal Multi-turn Memory Benchmark (MMMB) is a standardized dataset and evaluation framework measuring multi-modal LLMs’ ability to encode, retain, and retrieve information across multi-turn dialogues.
It integrates vision and text inputs, including noisy egocentric images, to simulate real-world conditions and assess cross-modal alignment and reasoning.
MMMB employs rigorous metrics, structured tasks, and transparent APIs to evaluate long-term conversational memory and retrieval-based question answering performance.

A Multi-modal Multi-turn Memory Benchmark (MMMB) is a standardized dataset and evaluation framework designed to rigorously assess the ability of multi-modal LLMs (MM-LLMs) to encode, retain, and selectively retrieve information spanning multiple modalities (typically vision and text) across sustained, context-dependent dialogues. MMMBs measure not only single-turn reasoning on multi-modal inputs but crucially test long-term conversational memory and reasoning, including the integration and recall of information distributed over multi-turn interaction histories. Benchmarks in this class are motivated by the limitations observed in current LVLMs and MM-LLMs with respect to visually-grounded dialogue, instruction following, conversational entity tracking, and retrieval-augmented generation (RAG) in real-world, noisy settings (Wang et al., 30 Oct 2025, Tong et al., 15 Oct 2025, Han et al., 21 Aug 2025).

1. Benchmark Objectives and Formal Definition

The primary goal of an MMMB is to quantify an MM-LLM’s multi-modal conversational memory. This entails measuring accuracy on final-turn questions that explicitly require retrieval and synthesis of facts or content introduced in distinct prior turns, often spanning both vision (images, video frames) and text. Formally, let the multi-turn history up to turn $T-1$ be

$H_{T-1} = ((I_1, U_1), \ldots, (I_{T-1}, U_{T-1}))$

where $I_t$ is an optional image and $U_t$ a user utterance. The model is expected to encode this history into an internal memory $S_{T-1} = M(H_{T-1})$ , and answer the final question $Q_T$ by generating

$A_T = f(S_{T-1}, Q_T)$

Success on such benchmarks requires both robust cross-modal alignment and the capacity to perform memory-based reasoning across multiple conversational turns. Critical evaluation axes typically include memory span (distance between relevant context and current turn), integration of multiple images/text, and the handling of temporally or thematically shifting dialogues (Tong et al., 15 Oct 2025).

State-of-the-art MMMBs are characterized by substantial scale and compositional diversity, as illustrated by the CRAG-MM and MMDR-Bench exemplars:

Scale and Structure: CRAG-MM features 6,462 single-turn (image, question, answer) triplets and 1,956 multi-turn conversations (average 4.9 turns, range 2–6) across 13 domains. MMDR-Bench includes 300 expert-curated complex multi-turn scenarios (5–7 turns per dialogue), focusing on visually grounded tasks (Wang et al., 30 Oct 2025, Han et al., 21 Aug 2025).
Modal Coverage: Egocentric (wearable-device-inspired) imagery is emphasized (6,248 egocentric vs. 1,695 normal out of 7,943 total images in CRAG-MM) (Wang et al., 30 Oct 2025). Dialogue histories may contain image-only, text-only, or mixed modality turns.
Image Quality and Realism: To reflect practical deployment contexts, egocentric data is systematically injected with noise: low-light, blur, truncation, occlusion, and rotation (15% of egocentric portion). This stresses the model’s robustness under authentic sensor degradations (Wang et al., 30 Oct 2025).
Question Typology: A balanced question taxonomy is enforced: simple recognition, simple knowledge, multi-hop, aggregation, comparison, and abstract reasoning. Furthermore, entity-popularity buckets (head/torso/tail) are sampled equally to probe model performance on both frequent and rare entities (Wang et al., 30 Oct 2025).
Memory Types (InteractiveOmni): Test questions are stratified by memory challenge—text memory, image memory, or mixed (requiring the combination of content across modalities and turns) (Tong et al., 15 Oct 2025).

3. Task and API Design

MMMBs operationalize the evaluation via a staged set of tasks and enterprise-grade retrieval APIs:

Single-source augmentation: Input is an image and question; retrieval is performed over an image-based knowledge graph (KG), with top-K similarity scoring using a precomputed vision encoder (e.g., CLIP ViT-L/14@336px embeddings and cosine similarity) (Wang et al., 30 Oct 2025).
Multi-source augmentation: Expands retrieval to both image-KG and a large web corpus (∼800K webpages in CRAG-MM), with queries constructed from current turn and dialogue context (Wang et al., 30 Oct 2025).
Multi-turn dialogue: At each turn $t$ , the conversational context

$C_t = \{I, Q_1, A_1, \ldots, Q_{t-1}, A_{t-1}\}$

is updated. The retrieval query may be a learned function of current image, question, and hidden state:

$q_t = \phi(I, Q_t, h_t)$

Retrieval hooks are provided for both image and web APIs, ensuring fair and reproducible result sets (Wang et al., 30 Oct 2025).

Automated APIs: Python-style mock APIs standardize retrieval:
1 2
results = search_pipeline(image=I, k=30) results = search_pipeline(query=q, k=50)
with results providing indices, URLs, entity metadata, or text snippets (Wang et al., 30 Oct 2025).

4. Evaluation Protocols and Metrics

MMMBs employ rigorous, scenario-tailored metrics:

Exact-Match Accuracy: Accurately answering the final-turn question yields a score of 1; “I don’t know” responses score 0; incorrect or hallucinated responses score -1 (Wang et al., 30 Oct 2025, Tong et al., 15 Oct 2025).

$y_i = \begin{cases} +1 & \text{perfect answer}\ 0 & \text{missing}\ -1 & \text{incorrect/hallucinated} \end{cases}$

Truthfulness is reported as $T = \frac{1}{N} \sum_{i=1}^N y_i$ .

Multi-turn Aggregation: An early-stop procedure triggers if two consecutive answers are missing or incorrect; turns thereafter are scored as 0. Multi-turn truthfulness $T_\mathrm{MT}$ aggregates across all sessions (Wang et al., 30 Oct 2025).
Memory-specific Curves: To assess memory capacity, conditional recall curves are provided:
- $R(d)$ : accuracy as a function of memory distance (how far back in the dialogue context the necessary information is)
- $R(k)$ : accuracy as a function of the number of historical images required for the answer (Tong et al., 15 Oct 2025).
Human Judgment: For models evaluated on MMDR-Bench, responses are scored by human annotators in six dimensions: entity tracking, dialogue consistency, reasoning depth, instruction adherence, error suppression, and fluency. Overall and per-dimension scores are averaged, and error rates are reported (Han et al., 21 Aug 2025).
Additional Metrics: Hallucination rate ( $\frac{\#\,y_i = -1}{N}$ ), early-stop rate, and missing-answer fraction are standard (Wang et al., 30 Oct 2025).

5. Empirical Findings and Model Performance

Empirical analysis across CRAG-MM, MMMB, and MMDR-Bench reveals the following key observations:

Baseline Performance: Straightforward MM-LLMs (best: Llama-3.2-90B-Vision-Instruct, GPT-5 Mini, Gemini-2.5-Flash) achieve 18% single-turn truthfulness, with basic retrieval-augmented pipelines reaching 22.5–31.5%. Multi-turn truthfulness rises to 42.5% with retrieval but remains well below perfect, and even industry SOTA models achieve only 32–45% (Wang et al., 30 Oct 2025).
Failure Modes: Entity recognition deteriorates dramatically for low-quality egocentric images, complex question types (comparison, aggregation, multi-hop), tail-entity queries, and multi-turn sessions with domain shifts (truthfulness declines up to 46% for low-light/occlusion) (Wang et al., 30 Oct 2025). Multi-turn context tracking yields high early-stop rates (27–60%).
Ablation and Modularity (MMDR-Bench): Removal of the memory module in CoLVLM Agent degrades average human score from 4.03 to 3.44, supporting the centrality of explicit memory modeling. CoLVLM demonstrates lower degradation with dialogue lengthening compared to SOTA alternatives (0.10 vs. 0.20–0.40) (Han et al., 21 Aug 2025).
Memory Depth and Recall: Accuracy decays as information must be retrieved from earlier turns or when answers require synthesis across multiple images (conditional recall curves $R(d)$ , $R(k)$ ). Only exact-match evaluation is typically considered, with automatic judging by LLMs (e.g., Gemini-2.5-Pro) to maintain consistency (Tong et al., 15 Oct 2025, Wang et al., 30 Oct 2025).

6. Implications, Best Practices, and Open Challenges

Methodological recommendations for future MMMBs include:

Axes of Balance: Datasets should address variability along image quality, question type, entity frequency, and conversation complexity. Realistic noise and domain shifts are essential for robust benchmarking (Wang et al., 30 Oct 2025).
Staged Task Structure: Progressively staged pipelines (single-source, multi-source, multi-turn) facilitate diagnosis of model deficits across retrieval, memory, and cross-modal reasoning (Wang et al., 30 Oct 2025).
Transparent APIs and Retrieval: Equal, public retrieval APIs and resources promote fair and reproducible comparisons, reducing confounds from proprietary systems (Wang et al., 30 Oct 2025).
Unified Evaluation: Adoption of simple, interpretable scoring schemas (e.g., $y_i \in \{-1,0,1\}$ ), with reporting of aggregate truthfulness, hallucination, missing, and early-stop rates yields actionable diagnostic power (Wang et al., 30 Oct 2025).
Memory Modeling: Robust memory representations, such as explicit dialogue state concatenation or RNN/Transformer-based hidden states, are critical—for example, $M_t = [(Q_1,A_1),\ldots,(Q_{t-1},A_{t-1})]$ , $h_t=\mathrm{Enc}_\theta(M_t)$ (Wang et al., 30 Oct 2025, Han et al., 21 Aug 2025).
Error Analysis: Fine-grained breakdowns by image-quality, question type, and entity rarity are necessary to pinpoint weaknesses and guide model or dataset improvement (Wang et al., 30 Oct 2025).
Extension to Additional Modalities: Expansion to video, audio, and embodied/robotics environments is highlighted as an important future direction, necessitating advancements in temporal memory and cross-modal fusion (Han et al., 21 Aug 2025).

7. Representative Benchmarks and Derived Protocols

Three leading benchmarks illustrate the current state of MMMB construction and evaluation:

Benchmark	Modality Coverage	Scale/Content	Evaluation Protocols
CRAG-MM (Wang et al., 30 Oct 2025)	Visual (egocentric+normal images), text	6,462 single, 1,956 multi-turn, 13 domains, injected noise	Exact-match, truthfulness, multi-turn, hallucination, early-stop
MMMB (InteractiveOmni) (Tong et al., 15 Oct 2025)	Images, text (audio/video extensible)	300 dialogue groups, up to 15 turns, memory challenge stratification	Final-turn accuracy, $R(d)$ , $R(k)$ curves
MMDR-Bench (Han et al., 21 Aug 2025)	Real/synthetic images, text, segmentation overlays	300 complex, multi-turn visually grounded scenarios	Six-way human scoring: entity tracking, reasoning, adherence, etc.

Each implements multi-turn context with factual recall spanning both temporal (turn distance) and compositional (multi-image/text) axes. The modular, iterative cognitive cycles observed in agents such as CoLVLM delineate best practices in memory-perception-planning-execution (Han et al., 21 Aug 2025). Distinctive annotation protocols—scene description, intent simulation, dialogue simulation—ensure diversity and challenge (Han et al., 21 Aug 2025, Wang et al., 30 Oct 2025).

References

CRAG-MM: “CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark” (Wang et al., 30 Oct 2025)
InteractiveOmni: “InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue” (Tong et al., 15 Oct 2025)
ContextualLVLM-Agent: “ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following” (Han et al., 21 Aug 2025)