MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Published 14 May 2026 in cs.CV, cs.CL, and cs.IR | (2605.15128v1)

Abstract: Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

Abstract PDF Upgrade to Chat

Authors (17)

First 10 authors:

Summary

The paper introduces a two-dimensional evaluation protocol that measures visual evidence granularity and memory reasoning depth.
It demonstrates that visual memory outperforms text-based methods for fine-grained details, with gains up to +0.29 in key settings.
Empirical results underscore the need for hybrid, update-aware architectures to manage dynamic, long-horizon multimodal interactions.

MemEye: A Visual-Centric Framework for Multimodal Agent Memory Evaluation

Motivation and Context

Memory is a central bottleneck for long-horizon multimodal agent systems, especially given recent advances in VLMs and the increasing prevalence of multimodal conversational applications. Prior benchmarks typically focus on either linguistic memory or short-context image understanding, lacking rigorous appraisal of whether agents can retain and reason over fine-grained visual evidence throughout extended interactions. Existing benchmarks allow the decisive visual information to be replaced by captions or textual context and rarely challenge agents with temporally evolving visual memory. These limitations hinder the identification of core failure modes in practical agent deployments, such as state-dependent reasoning and the preservation or synthesis of pixel-level visual details.

The MemEye Framework

MemEye presents a two-dimensional evaluation protocol that introduces a taxonomy for multimodal memory challenges, focusing explicitly on visual evidence and memory-based reasoning requirements. The two axes are:

Visual Evidence Granularity (X-axis): Ranges from scene-level (X₁), region-level (X₂), instance-level (X₃), to pixel-level (X₄), measuring the required precision of visual detail for answering the question.
Memory Reasoning Depth (Y-axis): Spans atomic retrieval (Y₁), monotonic relational association (Y₂), to non-monotonic evolutionary synthesis (Y₃), reflecting the requirements for evidence retrieval, integration, and temporal update resolution.

Each question in the benchmark is precisely labeled $(X, Y)$ according to the minimal evidence and reasoning depth needed.

Benchmark Construction

MemEye comprises 371 questions (both MCQ and open-ended) distributed across 221 sessions, 848 dialogue rounds, and 438 images, covering eight diverse life-scenario tasks (e.g., home renovation, brand recall, outdoor navigation). Stringent validation gates ensure:

Visual Necessity: Questions are only included if they cannot be solved from textual context or minimal captions and cannot be trivially answered by prior distributions or answer choices.
Shortcut Resistance: Only items passing multi-model, multi-rotation answer validation are retained.
Taxonomy Alignment: Each item is audited to demand the annotated $(X, Y)$ challenge.

This curation produces a benchmark where caption-only performance gives notably lower accuracy compared to true visual memory access, especially at higher granularity X₃–X₄—quantitatively demonstrating strong visual irreplaceability.

Systematic Evaluation

Thirteen memory architectures are evaluated across four leading VLM backbones (Qwen3-VL-8B, GPT-4.1-nano, GPT-5.4-mini, Gemini-2.5-flash-lite) using standardized retrieval, memory writing, and context curation pipelines. Memory systems include both text-abstraction-based and native visual evidence-based methods, spanning agentic, structured, and retrieval-based paradigms.

Metrics include:

Exact Match (EM) for MCQ—rotation-averaged over four answer positions to limit bias;
LLM-as-a-Judge for free-form response, validated for high human agreement (Cohen's $\kappa = 0.94$ ).

Empirical Insights

Visual Preservation

For coarse-grained evidence (X₁–X₂), text-based memory (e.g., dense captions) approaches the performance of visual memory systems, confirming that standard captions often retain sufficient information for high-level scene and region queries.
For instance- and pixel-level evidence (X₃–X₄), visual memory outperforms caption-based abstraction by wide margins, supporting the assertion that text conversion systematically loses essential detail (Caption-Proof gain $\Delta$ up to +0.29 in oracle settings).
Even with strong, task-aware captioning agents, captions fail to capture identity resolution, fine texture, micro-attribute, and cross-session visual state.

Temporal and Evolving-State Reasoning

Retrieval mechanisms suffice for monotonic association ( $Y_2$ ) but systematically fail in state-evolving scenarios ( $Y_3$ ) where temporal authority must be established over conflicting or outdated evidence.
Failure localization diagnostics demonstrate that retrieving the semantically correct evidence is insufficient; maintaining and prioritizing the valid, temporally current visual state is critical.
The recency re-ranking probe reduces, but does not eliminate, stale evidence selection, confirming that robust memory requires explicit mechanisms for update chains and override detection.

Systematic Trade-offs

Text-based memory excels in state organization, supporting better update chains and conflict tracking but fails to answer fine-grained or pixel-level visual queries.
Visual memory preserves decisive visual attributes but is susceptible to error from aggregating visually similar but temporally incorrect evidence, underscoring the need for joint state selection and evidence-gating machinery.
As memory context length and topic diversity scale, retrieval and structured memory become increasingly crucial for information routing and avoiding context window overflows.

Broader Implications and Future Directions

The MemEye results challenge a range of current research assumptions in large multimodal agent memory:

RAG-style retrieval is not sufficient—syntactic or latent similarity must yield to temporally-aware, update-sensitive reasoning in dynamic environments.
Hybrid architectures, combining both image-level and structured (text/temporal/log) memory, as well as advanced evidence selection (e.g., recency, consistency, agentic routines), are essential for practical deployment in complex settings.
Evaluation must incorporate both visual irreplaceability and memory dynamics: only systems scoring reliably on high- $(X,Y)$ cells can claim robust long-term multimodal interaction support.

MemEye, as an open-source benchmark and protocol, enables the community to pinpoint failure modes, perform ablation studies, and track progress on memory architectures as they scale in complexity and scenario fidelity.

Conclusion

MemEye is a diagnostic and minimally confounded framework for isolating the critical failure modes of multimodal agent memory. It establishes that retention and recall of decisive visual evidence—particularly at fine spatial granularity and under state evolution—is not yet solved in current architectures. This underscores the imperative for hybrid, update-aware memory systems in advancing robust, trustworthy, and capable multimodal agents.