MMMB: Multi-Modal Multi-Turn Memory Benchmark
- Multi-Modal Multi-Turn Memory Benchmark is a formal protocol that tests the ability of multimodal LLMs to retain and reason over extended textual and visual dialogue sequences.
- It uses evaluation metrics such as final-turn exact-match accuracy and forgetting curves to measure performance across text, image, and mixed memory tasks.
- MMMB enables comparative analysis of memory architectures and augmentation strategies, guiding improvements in memory retention for complex conversational models.
A Multi-Modal Multi-Turn Memory Benchmark (MMMB) is a formal evaluation protocol designed to probe the long-term memory capabilities of multimodal LLMs (MLLMs) in open-ended, visually grounded, multi-turn conversational tasks. The primary objective is to quantify an MLLM’s ability to encode, retrieve, and reason over accrued information—textual and visual—across extended, contextually rich dialog sequences. MMMB isolates memory-dependent performance by focusing on tasks where correct answers depend on either factual extraction, reasoning, or management of evolving multimodal information, often under severe context-length and distraction constraints. This enables diagnostic comparison across architectures, memory-augmentation strategies, and attention mechanisms.
1. Formal Definition and Core Objectives
An MMMB is constructed as a suite of multi-turn, multimodal dialogues, each composed of a sequence of turns , where each turn includes a user prompt (text), one or more images, and, where relevant, an assistant response. Dialogues are sampled to span a range of memory dependencies:
- Text Memory: Correct responses depend solely on textual history.
- Image Memory: Correct responses depend on images (current or referenced).
- Mixed Memory: Correct responses require joint reasoning over historical text and images.
A canonical MMMB protocol evaluates the model’s answer to a designated “final question” in each dialogue, with the ground-truth answer strictly dependent on recalling or integrating information from previous turns. The evaluation metric is typically exact-match accuracy: where is the set of dialogues, the reference answer, and the model’s output for the -th dialogue’s final turn (Tong et al., 15 Oct 2025).
For deeper analysis, MMMB also measures forgetting curves—accuracy as a function of (a) the number of historical items/entities to be recalled, and (b) the distance in turns between critical evidence and the evaluation point (Tong et al., 15 Oct 2025). This quantifies context-length resilience and modality-conditioned degradation.
2. Dataset Construction and Characteristics
Modern MMMBs employ carefully synthesized and/or real-world, human-verified dialogues to ensure dense multimodal dependencies and realistic memory demands. Key dataset characteristics include:
| Aspect | Value/Range |
|---|---|
| Dialogues | 300–5,000+ (e.g., MMMB: 300 (Tong et al., 15 Oct 2025); MMRC: 5,120 (Xue et al., 17 Feb 2025)) |
| Turns/dialogue | 2–22 (typical max: 15–22) |
| Images/dialogue | 2–5+ |
| Modalities | Text, Images (optionally Audio/Video) |
Each dialogue group is annotated for memory type (“text,” “image,” “mixed”) and curated such that the dependence structure is explicit and unambiguous. Images and text are sampled or composed so that memory-related questions cannot be answered from the last turn or image alone; rather, historical context is required. Systems like Mem-Gallery (Bei et al., 7 Jan 2026) further annotate each QA pair with precise supporting turns (“clues”) to enable fine-grained retrieval analysis.
Recent benchmarks (e.g., Mem-Gallery, MMMT-IF (Epstein et al., 2024), MMRC (Xue et al., 17 Feb 2025)) introduce added complexity such as:
- Cross-session dependencies
- Interleaved global instructions
- Factual updates and corrections
- Realistic image quality variation (as in CRAG-MM (Wang et al., 30 Oct 2025))
- Instruction following under programmatic metrics (PIF) (Epstein et al., 2024)
3. Evaluation Protocols and Metrics
MMMBs are evaluated using deterministic or lightly randomized protocols to ensure replicability and diagnostic power:
- Primary Metric: Final-turn exact-match accuracy: ratio of correct answers to total test dialogues.
- Decomposed Memory Metrics: Accuracy stratified by memory type (Text, Image, Mixed), memory burden (#entities to recall), and turn distance from supporting evidence.
- Programmatic Metrics: For instruction-following variants, the (Programmatic Instruction Following) score quantifies the fraction of historical instructions satisfied by the response: where is the context and is the response (Epstein et al., 2024).
- Robustness Metrics: measures how consistently multiple samples from a model satisfy all instructions.
- Auxiliary Metrics: BLEU-1 (for n-gram overlap), F1/Precision/Recall (token-level), LLM-based scoring for open-ended responses (Mem-Gallery).
Advanced protocols may require:
- Early-stop decisions in multi-turn runs (e.g., for two consecutive failures: CRAG-MM).
- Memory “abstention” (answer refusal) for unreachable or unsupported queries (high AR precision).
4. Baseline Model Results and Observed Limitations
Benchmarking MLLMs on MMMB-style tasks consistently reveals substantial memory degradation as the context grows, and profound differences between memory architectures.
| Model | Text Mem | Image Mem | Mixed Mem | Average Accuracy |
|---|---|---|---|---|
| GPT-4o-mini | 70.00 | 29.41 | 58.06 | 51.33 |
| Gemini-2.5-Flash | 75.76 | 40.19 | 70.97 | 60.84 |
| InteractiveOmni-4B | 70.71 | 30.39 | 59.68 | 52.47 |
| Qwen2.5-VL-7B | 35.35 | 13.73 | 27.42 | 25.10 |
All models experience pronounced decline as either the number of items to be recalled increases or the temporal distance between evidence and query grows. Open-source models (Qwen, InternVL, etc.) underperform proprietary systems by large margins, especially on image-memory and mixed-modality tasks (Tong et al., 15 Oct 2025). Multimodal retrieval-augmented systems (MuRAG, UniversalRAG) achieve the best overall scores, but still fail to exceed $0.70$ F1/accuracy on the most demanding mixed-memory cases (Bei et al., 7 Jan 2026).
Common failure modes include:
- Early forgetting of image features (especially with token-limited or naively concatenated memory)
- Inability to identify and reconcile updates/corrections across sessions or turns
- Hallucinated or refused answers under uncertainty
- Catastrophic context-window overflow in “unstructured” approaches
Emerging strategies such as structured external note-taking (MMRC) (Xue et al., 17 Feb 2025) and explicit memory annotation/retrieval augment (Mem-Gallery, MuRAG) are shown to increase performance, particularly for information extraction and memory recall subtasks.
5. Comparative Analysis with Related Benchmarks
The MMMB paradigm extends prior single-turn and unimodal memory benchmarks in scope and difficulty:
- MMRC (Xue et al., 17 Feb 2025): Focuses on six abilities (IE, CR, IU, IM, MR, AR) in unconstrained human–MLLM dialogue. It diagnoses memory recall decline, error propagation, and refusal reluctance in open-ended, real-user contexts.
- CRAG-MM (Wang et al., 30 Oct 2025): Targets retrieval-augmented generation (RAG) in egocentric, wearable scenarios; mixes image/text-web RAG with multi-turn support but emphasizes retrieval over sustained memory.
- Mem-Gallery (Bei et al., 7 Jan 2026): Systematic, multi-session, annotated memory evaluation with dense visual and textual interdependencies and retrieval-based organizational protocol analysis.
- MMMT-IF (Epstein et al., 2024): Specializes in multi-turn multimodal instruction following, introducing the PIF metric for execution-verified adherence under scattered instruction sets.
A distinguishing feature of MMMB (in the sense of (Tong et al., 15 Oct 2025)) is the isolation of memory-dependent answerability apart from single-turn recognition or reasoning, and its inclusion of explicit memory-type annotation and “forgetting curve” diagnostics.
6. Open Challenges and Design Recommendations
Persistent limitations of current MLLMs include:
- Multimodal Retention: Text-only or caption-centric systems lag direct visual retrieval by accuracy across several benchmarks; raw image embeddings are essential (Bei et al., 7 Jan 2026).
- Memory Organization: Flat (retrieval-based) memory structures often outperform hierarchical or complex structured memories due to better noise management (Bei et al., 7 Jan 2026).
- Reasoning and Knowledge Management: Dynamic update detection and multi-entity/temporal reasoning remain bottlenecks, with typical F1/accuracy even in best systems (Bei et al., 7 Jan 2026).
- Robustness to Instruction Spread: Retrieval of distributed constraints (instructions) is a major failure point; performance (PIF) recovers dramatically when all instructions are re-appended to the context (Epstein et al., 2024).
Future MMMB design should incorporate:
- Explicit representation of raw, high-dimensional modalities (video, audio).
- Modularization of memory layers (short, long, and semantic-term buffers).
- Relevance-aware, possibly learned retrieval filtering to minimize token wastage and maximize Precision@K.
- Revision operators to recognize and update contradicting facts or evolving entities over long horizons.
- Benchmarks for cross-domain transfer and continual update robustness in real-world conversational flows (Wang et al., 30 Oct 2025).
- Programmatic instruction-following metrics and robustness evaluation under repeat sampling (Epstein et al., 2024).
7. Significance and Research Impact
MMMBs form a critical infrastructure for diagnosing, quantifying, and accelerating progress in memory-augmented multimodal models. They provide a standardized, controlled setting to compare architectures, training strategies, and retrieval mechanisms under pressure from context-length, modality, and reasoned-update requirements.
By revealing sharp regime changes and failure points—e.g., token-limit drop-off, hallucinated updates, fact-overwrite collisions—MMMBs guide the field toward development of more robust, contextually aware, and memory-consistent agents.
Researchers leveraging MMMB protocols and datasets are now able to reproducibly evaluate incremental advances and investigate systemic sources of long-range memory failure, driving rapid iteration in both fundamental model architecture and external memory augmentation (Epstein et al., 2024, Xue et al., 17 Feb 2025, Tong et al., 15 Oct 2025, Bei et al., 7 Jan 2026).