M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

Published 5 Jun 2026 in cs.CL | (2606.07402v1)

Abstract: Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of concealed user information. We therefore introduce M$^3$Exam, a query-centric multimodal conversational memory benchmark built on realistic user-agent interaction, with multi-dimensional evaluation spanning cross-modal grounding and implicit information inference. Benchmarking MLLMs and memory systems reveals persistent gaps in cross-modal grounding, cross session reasoning, and the efficiency cost of accumulating multimodal context. We further propose M$^3$Proctor, a multimodal memory method that detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces M³Exam, a benchmark for evaluating long-term multimodal memory in realistic, multi-session user-agent interactions.
The methodology integrates staged LLM-driven dialogue synthesis with explicit modality bias detection and cascaded retrieval to assess cross-modal reasoning and implicit inference.
Results show that modality-aware retrieval in M³Proctor dramatically reduces token costs and improves accuracy compared to global retrieval methods.

Multimodal Conversational Memory Evaluation with M $^3$ Exam

Motivation and Benchmark Construction

Addressing the gap in evaluating long-term multimodal memory capabilities for task-oriented agents, M $^3$ Exam establishes a comprehensive benchmark for realistic, persistent, heterogeneous user–agent interactions. Unlike prior work, which is either text-only, static with sparse visuals, or limited to single-session/single-document memory scopes, M $^3$ Exam models the organic accumulation and use of multimodal memory over extended, multi-session user histories (Figure 1).

Figure 1: Overview of M $^3$ Exam, emphasizing its long-horizon, cross-modality conversational scope.

The M $^3$ Exam benchmark is constructed to capture the following evaluation axes:

Content Complexity: Agents must sustain memory over 239 synthetic multi-session conversations, each spanning 15 distinct persona scenarios and covering 3,025 dialogue rounds, with 1,799 multimodal artifacts (images, PDFs, charts). The sessions are heterogeneously structured, with attached files spanning technical documents, daily-life photos, and domain-specific charts.
Temporal and Cross-Modal Reasoning: 5,150 evaluation questions are carefully balanced across single-session, multi-session, cross-modal, retrieval, and interpretation-oriented tasks. Ground-truth supporting-fact annotations pinpoint the cross-session span and modalities for each query.
Implicit Inference: Unlike benchmarks limited to direct QA, M $^3$ Exam includes questions demanding inference from unstated, globally distributed, or persona-grounded cues, requiring agents to reconstruct the user’s latent intent or contextual state.

The benchmark features rigorous pipeline synthesis (Figure 2), combining staged LLM-driven event generation, event-to-session mapping, dialogue and artifact alignment, and question bank synthesis, with extensive validation overlays (self-checks, expert audits).

Figure 2: The M $^3$ Exam construction pipeline: from persona and event seeds to synthetic multimodal dialogues and QA.

M $^3$ Exam’s distribution is methodically balanced at the persona, question-type, and artifact levels (Figures 6, 7, 8):

Figure 3: Scaling properties per persona; coverage of dialogue size, QA instances, and file attachment across scenarios.

Figure 4: Properties such as memory look-back distributions, turn-length densities, QA-type per persona, and session timelines.

Figure 5: QA composition, with separation by difficulty, evidence load, memory span, and visual dependence.

Benchmark Protocol and Evaluation Metrics

M $^3$ Exam defines the agent’s task as: given a temporally ordered multimodal history $\mathcal{H}$ , answer query $^3$ 0 by retrieving relevant spans and artifacts and emulating human-like inference. Evaluation focuses sharply on capabilities that probe long-range memory and cross-modal reasoning, using a suite of metrics:

Exact Match (EM, including image identifier matching and multiple-choice)
F1 (token-level overlap)
BLEU-1
LLM-as-a-Judge (LLM-J): five-level rubric assessing semantic adequacy beyond lexical overlap

All metrics are reported by question subtype and aggregated into a weighted overall score, with particular emphasis on the LLM-J metric for compositional, open-ended tasks. The design controls for reasoning quality rather than mere retrieval by providing “oracle” supporting facts during closed-book evaluation and contrasting with long-context ingestion policies (Figure 6).

Figure 6: Performance penalty when answering from the full conversation history versus the supporting sessions only, isolating the importance of targeted retrieval.

Agentic Memory Systems and Multimodal Modality-Aware Baseline

Baseline methods span text-only and multimodal retrieval-augmented architectures, including NaiveRAG, A-Mem, Mem0, MemoryOS, UniversalRAG, RAG-Anything, MIRIX, MemVerse, and NGM. Notably, M $^3$ 1Exam exposes the brittleness of context-insensitive “global” retrieval for memory over heterogeneous, temporal dialogue.

The authors introduce M $^3$ 2Proctor: the first open multimodal memory system that incorporates query modality bias, performing explicit bias detection (by instruction-tuned LLMs), re-ranking retrieval by modality, and modality-aware cascaded retrieval (Figure 7). The pipeline (indexing, retrieval, escalation) is optimized for minimal reliance on raw visual input unless necessary for answer quality, dramatically reducing token and compute costs.

Figure 7: Architecture of M $^3$ 3Proctor, which detects query modality demand and escalates to raw artifacts only as required.

Cascade behaviors are measured in detail (Figure 8), demonstrating that visual input is consumed almost exclusively for those QA types that require it, while most queries are resolved in the text surrogate regime—preserving both efficiency and accuracy.

Figure 8: Ablation studies and efficiency/performance tradeoffs for M $^3$ 4Proctor’s modality cascading.

Main Results and Analysis

M $^3$ 5Exam’s evaluation reveals that even state-of-the-art MLLMs (GLM-5.1, Qwen3.6-Plus, GPT-5.4, Claude-Opus-4.6, Gemini-3.1-Pro, etc.) fail to exceed an overall semantic score of 0.55 in an oracle-evidence setting (Figure 9), while agentic-memory baselines fall even shorter absent modality-awareness. The largest performance gaps appear for cross-modal grounding (mr, fm-type queries) and implicit interpreting (thematic th, implicit-inference ii), with open-ended failures persisting even for long-context “buffer” policies.

Figure 9: Performance landscape for eight leading closed-source MLLMs, reporting per-type, per-metric breakdowns.

M $^3$ 6Proctor achieves the highest score among open-weight agentic-memory systems across all backbones (Qwen-2.5-VL-7B, GPT-5-Nano, Doubao-Seed-2.0-Pro), closing the gap to frontier MLLMs primarily on cross-modal and implicit-inference categories. Critically, it does so with a 72x reduction in index construction time and a 3–10x reduction in per-query token use compared to indiscriminate multimodal memory architectures.

Performance is robust to backbone scale: M $^3$ 7Proctor run on 3B or 7B backbones outperforms stronger baselines on larger models, underscoring that gains stem from modality-aware retrieval and not backbone size.

Analysis (Figure 10) reveals that performance lift from true multimodal evidence reaches +0.3 EM for mr-type and +0.116 for fm-type questions. The modality cascade triggers on 68% of fm and 51% of mr queries, confirming accurate detection and selective escalation.

Figure 10: Influence of multimodal evidence and capability breakdown across QA type: cross-modality and complex interpreting pose the largest challenge.

Emergent Findings: Memory as a Bottleneck

Even with oracle retrieval, frontier MLLMs are deficient on long-range, implicit, and multimodal grounding (max score ≈0.55).
Shortcomings are systematic across all advanced MLLMs, not explainable by a single model’s weaknesses.
Indiscriminate multimodal memory is both inefficient and less accurate than modality-aware approaches, with cascading and bias detection recapturing nearly all multimodal accuracy at text-only cost profiles.

Case studies (Figures 10–11) underscore the challenge posed by implicit-inference QA: only systems with sufficient, targeted evidence recovery and context calibration succeed.

Figure 11: Example implicit inference for a veterinary persona requires piecing together role from noncontiguous memory cues.

Figure 12: Example barista scenario illustrating requirements for multi-session, cross-modal memory plus implicit context reconstruction.

Implications and Future Directions

M $^3$ 8Exam highlights a realistic ceiling on practical AI assistant deployment for cross-modal, long-term memory workloads. It demonstrates the necessity of explicit modality-sensitive memory systems for scalable, deployable agents and motivates continued research along:

More sophisticated query and memory summarization (dynamic summarization, graph-based memory, Zettelkasten methods)
Long-horizon, multi-turn dialogue involving memory evolution and adaptation
Deep user modeling and latent intent inference over extended multimodal histories

Improvements in open-weight, fine-tuned modality-aware memory baselines such as M $^3$ 9Proctor provide a promising, resource-efficient path forward, yet truly robust multimodal reasoning still requires more advanced implicit inference and global context tracking. The field remains stalled well below human-level cross-modal conversational memory.

Conclusion

M $^3$ 0Exam is a rigorous, challenge-oriented benchmark that exposes systematic limitations in the multimodal memory, reasoning, and interpretive capabilities of both current MLLMs and agentic memory systems (2606.07402). By introducing explicit query modality bias modeling and modality-aware retrieval escalation, M $^3$ 1Proctor sets a new state of the art for scalable, efficient cross-modal conversational memory. Key gaps remain—especially in implicit inference—which will likely drive innovations in representation, summarization, and user-model-driven memory architectures in future AI systems.