MemLens: Multimodal Memory Benchmark

Updated 4 July 2026

MemLens is a benchmark that evaluates long-term multimodal memory in multi-session conversations by comparing long-context LVLMs and memory-augmented agents under a unified cross-modal token counting scheme.
It organizes 789 questions across five memory abilities—information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal—spanning context lengths from 32K to 256K tokens.
Empirical findings reveal that neither long-context LVLMs nor memory agents alone achieve robust multimodal memory, underscoring the potential of hybrid architectures that combine direct attention with structured retrieval.

MemLens is a benchmark for memory in multimodal multi-session conversations designed to compare two method families for large vision-language systems under a unified protocol: long-context LVLMs that ingest long interleaved histories directly, and memory-augmented agents that compress, index, and selectively retrieve past content from external memory (Ren et al., 14 May 2026). It comprises 789 questions across five memory abilities—information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal—at four standard context lengths from 32K to 256K tokens under a cross-modal token-counting scheme (Ren et al., 14 May 2026). Its defining premise is that multimodal conversational memory should be evaluated on questions that genuinely require image-grounded evidence rather than permitting text-only shortcuts.

1. Problem framing and intended scope

MemLens was introduced to address a gap in the evaluation of long-term memory for large vision-LLMs. Real agents powered by LVLMs interact over time, accumulate information across many sessions, incorporate new facts, update or forget old ones, and must remain consistent across long conversational histories. The benchmark is explicitly positioned against two existing method directions: long-context LVLMs, which extend the native context window to ingest entire histories with interleaved images, and memory-augmented agents, which store compressed or indexed representations outside the main context and retrieve them when queried (Ren et al., 14 May 2026).

The benchmark’s stated contribution is comparative as well as diagnostic. Prior long-context multimodal benchmarks focus on documents or video rather than multi-session dialogue, and conversational-memory benchmarks are predominantly text-only or permit text-only answering for ostensibly multimodal items. MemLens is described as the first benchmark to simultaneously enforce image necessity via cross-modal design and ablations, cover five core memory abilities needed by conversational assistants, and compare 27 LVLMs and 7 memory agents at standardized context lengths using a cross-modal token-counting scheme (Ren et al., 14 May 2026).

The benchmark’s central empirical claim is also its main conceptual conclusion: neither long-context LVLMs nor memory agents alone solve multimodal long-term memory. Long-context LVLMs perform strongly at short context lengths because they can directly ground on visual evidence, but their performance degrades as the conversation grows. Memory agents are comparatively length-stable, but they lose visual fidelity because images are compressed into embeddings or captions at storage time. Multi-session reasoning caps most systems below 30% accuracy, which the authors identify as evidence for hybrid architectures combining long-context attention with structured multimodal retrieval (Ren et al., 14 May 2026).

2. Task structure and memory abilities

MemLens organizes its 789 questions into five memory abilities, each intended to isolate a recurring requirement of multimodal assistants. Information Extraction retrieves a specific fact from one session, often through two-hop image-to-text grounding; its IE-Entity subtype enforces visual disambiguation via abstraction, and IE-PrevInfo requires recalling a visual detail from a prior session. Multi-Session Reasoning aggregates facts scattered across 3–8 sessions, including counting, arithmetic, and entity resolution tasks. Temporal Reasoning integrates timestamps, natural-language time expressions, and visual temporal artifacts such as clock or calendar images. Knowledge Update tracks a four-step update chain and requires reporting the current rather than stale value. Answer Refusal removes necessary evidence and tests whether a system abstains rather than hallucinates (Ren et al., 14 May 2026).

The question distribution is explicitly reported: IE 246, MSR 143, TR 194, KU 116, and AR 90. The multimodal dependency profile is likewise quantified: 65.7% of questions are image-essential, 14.7% image-supportive, and 19.6% text-sufficient; 80.4% of questions have evidence that includes images (Ren et al., 14 May 2026). This distribution matters because the benchmark is not simply a long-context dialogue set with occasional images; it is constructed so that the visual channel is operationally necessary for most items.

Each benchmark instance compiles user–assistant sessions into a long conversation history with interleaved images. Sessions per question grow from 14 to 93 as context length increases from 32K to 256K tokens. Evidence sessions are mixed into topically related haystack sessions, while text-only filler sessions preserve a fixed text-per-image ratio and mask evidence clustering. The progression is intentionally nontrivial: early sessions establish context and distractors, evidence sessions appear at random timestamps except where knowledge-update order must be preserved, and later sessions add additional distractors. This design means that success requires retrieving the correct sessions and integrating their visual and textual content rather than relying on local lexical cues (Ren et al., 14 May 2026).

The benchmark description includes concrete exemplars. In one MSR-Arithmetic instance, prices of related purchases appear across five sessions, with one price visible only on a box image; the system must sum all prices. In MSR-Counting, some object appearances occur only in photographs. In TR-Temporal Grounding, an event’s date is readable only from a calendar or receipt image and must be aligned with session timestamps. These examples indicate that MemLens treats multimodal memory as a joint retrieval-and-grounding problem rather than a purely textual recall problem (Ren et al., 14 May 2026).

A central methodological feature of MemLens is its cross-modal token-counting scheme, which standardizes context length across models with heterogeneous visual encoders. For each evaluation instance, the total budget satisfies

$T_{\text{total}} = T_{\text{text}} + T_{\text{image}},$

with

$T_{\text{image}} = \sum_{j=1}^{N_{\text{img}}} \tau(I_j),$

and

$T_{\text{text}} = L - T_{\text{image}}, \qquad L \in \{32K, 64K, 128K, 256K\}.$

In MemLens, $\tau(I_j)$ is a fixed constant per image, approximately 2,000 tokens per image, independent of model specifics (Ren et al., 14 May 2026). The scheme follows MMLongBench-style cross-modal counting, normalizing away model-specific patching or tiling differences so that the same compiled conversation yields the same effective cross-modal length for all systems.

This standardization has dataset-scale consequences. As $L$ grows from 32K to 256K, images per instance increase from 20 to 138, while sessions per instance increase from 14 to 93. Average turns per session are approximately 10, and average images per session are approximately 1.5 (Ren et al., 14 May 2026). The fixed text-per-image ratio is intended to avoid accidental image clustering that might reveal evidence positions.

Dataset construction follows a four-stage pipeline. First, multimodal sessions are simulated by sampling topics from a hierarchical ontology, retrieving candidate images via iCrawler, filtering them with CLIP, SigLIP, and BLIP-2 caption scoring, and generating multi-turn dialogues with GPT-5.1 as user and Gemini-3-Pro as assistant. Second, question construction enforces cross-modal dependency by selecting a salient entity, retrieving a high-relevance image, and replacing the entity with an abstraction such as a generic anaphor referring to the image. Third, evidence sessions are wrapped in realistic dialogue matched to haystack style so that they cannot be trivially retrieved through surface similarity. Fourth, conversations are assembled by interleaving evidence, haystack, and text-only filler sessions under the fixed text-per-image ratio (Ren et al., 14 May 2026).

Quality control is multi-layered. Automated filtering removes text-solvable items and questions answerable from parametric knowledge alone. Human review then audits image necessity, session naturalness and recoverability, and haystack image quality and dialogue flow in three rounds, reducing approximately 20k candidates to 789 final items. An indistinguishability check reports that a text classifier cannot reliably separate evidence from haystack, with approximately 57–58% F1 (Ren et al., 14 May 2026). This suggests that retrieval shortcuts based on stylistic differences were explicitly targeted during curation.

The image layer is also documented at release time. The benchmark uses 4,695 images retrieved from public web search, applies negative-content filtering to reject watermarks, stock logos, and copyright overlays, and releases per-image provenance metadata including URL, timestamp, caption, scores, and pHash. Author-produced artifacts are released under CC-BY-4.0, code under MIT, and third-party images retain their original licenses with a takedown mechanism (Ren et al., 14 May 2026).

4. Evaluation protocol and scoring

The primary evaluation metric in MemLens is LLM-as-Judge accuracy. The judge reads the question, reference answer, and model output, then returns a binary verdict using task-specific criteria. The judge model is Qwen3-VL-235B-Instruct, and its decisions are cross-validated against GPT-5.4-mini with Cohen’s $\kappa = 0.93$ on 800 items and against a three-annotator human consensus with $\kappa = 0.86$ on 484 items. The paper notes a known small leniency bias for short outputs and states that a correction is applied (Ren et al., 14 May 2026).

To decompose performance, the benchmark defines coverage, per-answer accuracy, and correct answer refusals. With 699 answerable items out of 789 total, overall judged accuracy is summarized as

$J \approx \frac{\text{Cov} \times \text{PA} \times 699 + AR_{\text{correct}}}{789}.$

Here, $\text{Cov}$ is the fraction of answerable questions attempted, $\text{PA}$ is accuracy on attempted answers only, and $T_{\text{image}} = \sum_{j=1}^{N_{\text{img}}} \tau(I_j),$ 0 is the number of correct refusals on the 90 answer-refusal items (Ren et al., 14 May 2026). This decomposition is intended to expose calibration–competence trade-offs, especially where a system’s abstention policy affects apparent accuracy.

Length sensitivity is part of the protocol rather than a post hoc analysis. All systems are scored at 32K, 64K, and 128K, while memory agents are also evaluated at 256K. The benchmark reports degradation by type across lengths and states that monotonicity was validated, with no systematic “getting better with longer input” artifacts after correction (Ren et al., 14 May 2026).

Prompting and decoding are standardized where possible. The user prompt instructs systems to “Directly output the answer with no extra output.” Generation budgets are 2,048 tokens for direct models and 16,384 for thinking models. Open-weight models are served via vLLM v0.17–0.18 with FlashAttention-2 on 8×A100-80GB nodes, while API models are evaluated through provider endpoints using 4–8 concurrent threads (Ren et al., 14 May 2026). The benchmark’s code and evaluation harness are released at the project repository, and the dataset is hosted on Hugging Face (Ren et al., 14 May 2026).

5. Empirical findings and error patterns

The image-ablation study is the strongest direct validation of MemLens’s multimodal design. On the subset of image-essential and image-supportive questions ( $T_{\text{image}} = \sum_{j=1}^{N_{\text{img}}} \tau(I_j),$ 1), GPT-5.4 drops from 93.13% with images to 1.74% without images, and Gemini-3.1-Pro drops from 89.42% to 1.89%. The per-type drops are likewise severe: for GPT-5.4, IE goes from 94.31 to 0.41, MSR from 100 to 0, TR from 96.91 to 5.15, and KU from 75.86 to 0; for Gemini-3.1-Pro, IE goes from 89.02 to 0, MSR from 90.21 to 0, TR from 96.19 to 6.19, and KU from 82.24 to 0 (Ren et al., 14 May 2026). The authors interpret this as confirmation that MemLens is genuinely multimodal and image-grounded, especially on the 80.4% of questions whose evidence includes images.

The model roster includes 27 LVLMs and 7 memory-augmented agents. The LVLM cohort spans closed-source systems such as GPT-5.4, Claude Sonnet 4.5, and Gemini-3.1-Pro, and open-weight families including Kimi-K2.5, Qwen3.5, Qwen3-VL, GLM-4.6V, GLM-4.5V, Gemma3, Phi4-Multimodal, Cosmos-Reason2-8B, and Nemotron-Nano-12B. The agent cohort includes multimodal pipelines such as M3-Agent, M2A, and M3C, and text-only pipelines that replace images with BLIP-2 captions, including Mem0, MemOS, MemAgent-7B, and Memory-T1 (Ren et al., 14 May 2026).

At 32K, strong LVLMs cluster closely: the top eight systems fall within a 6.34% band, Qwen3.5-122B reaches 58.68% overall, Kimi-K2.5 leads MSR at 44.06%, and answer refusal is comparatively easy at short length, reaching up to 97.78% (Ren et al., 14 May 2026). The type ceilings at 32K are explicitly reported as approximately AR 97.78%, TR 60.82%, IE 74.39%, KU 50.86%, and MSR 44.06% (Ren et al., 14 May 2026). These ceilings indicate that even at moderate context lengths, no task category is close to saturation except answer refusal.

At 128K, the asymmetry between the two method families becomes clearer. Open-weight LVLM leaders often lose more than 13% overall, whereas Gemini-3.1-Pro retains 51.99%, corresponding to only a 2.11-point overall decline and making it the most length-robust LVLM in the evaluated cohort (Ren et al., 14 May 2026). Memory agents, by contrast, are reported as length-stable: most remain within $T_{\text{image}} = \sum_{j=1}^{N_{\text{img}}} \tau(I_j),$ 2 from 32K to 256K. However, they trail LVLMs on visually grounded tasks such as IE and KU, and on AR. The paper attributes part of the AR gap to post-training: RL- or LoRA-modified agents often fall to 9–22% AR, whereas frozen-backbone frameworks retain higher refusal performance, with Mem0 at 77.27% and MemOS at 68.18% (Ren et al., 14 May 2026).

The error analysis identifies distinct failure modes. For LVLMs, visual retrieval failure increases with length, and hallucination control degrades most sharply on AR. For agents, visual fidelity is lost at storage time because images are converted into embeddings or captions, making fine-grained visual evidence unrecoverable (Ren et al., 14 May 2026). At 128K, approximately 90% of IE and KU errors are classified as Visual, TR errors split between Mixed and Reasoning, and MSR errors are dominated by Reasoning at 73%, although the analysis emphasizes that these reasoning failures are downstream of retrieval misses. As contexts lengthen, near-misses become total misses: “unsupported answer” rises while “grounding failure” and “computation slip” fall (Ren et al., 14 May 2026). This suggests that retrieval erosion, rather than late-stage arithmetic or logical failure, is the first-order bottleneck at longer lengths.

Multi-session reasoning receives special attention because it caps most systems below 30% accuracy. The benchmark reports an oracle diagnostic in which the 3–8 evidence sessions are supplied directly, eliminating the haystack-retrieval problem. Under that condition, MSR reaches 100% for GPT-5.4 and 90.21% for Gemini-3.1-Pro (Ren et al., 14 May 2026). A plausible implication is that the principal difficulty of MSR in MemLens lies in locating all required sessions within long multimodal histories, not in aggregating the retrieved evidence once found.

6. Position in the literature and other uses of the name

MemLens is situated against several benchmark traditions. Document- and video-oriented long-context benchmarks such as MMLongBench, MMLongBench-Doc, and Multimodal NIAH measure scaling and retrieval but are not multi-session conversations and rarely compare memory agents alongside LVLMs. Conversational-memory benchmarks such as LongMemEval, PerLTQA, and MemoryAgentBench are text-only. Persona-grounded multimodal dialogue benchmarks such as LoCoMo and Mem-Gallery include images but allow text-only shortcuts on many questions. MemLens differentiates itself by requiring visual evidence, standardizing length via cross-modal token counting from 32K to 256K, comparing both LVLMs and memory agents under the same protocol, and covering five abilities within multimodal multi-session histories (Ren et al., 14 May 2026).

The term “MemLens” or “memory lens” is not unique to this benchmark. In a distinct line of work, “Minerva: A Programmable Memory Test Benchmark for LLMs” presents an automatically generated, interpretable benchmark for textual in-context memory capabilities, focusing on atomic and composite tasks such as search, recall/edit, compare, set/list operations, and state tracking in short contexts (Xia et al., 5 Feb 2025). Another paper, “MemLens: Uncovering Memorization in LLMs with Activation Trajectories,” uses the name for a white-box detector of contamination and memorization in math benchmarks by analyzing layerwise probability trajectories of numeric tokens (He et al., 25 Sep 2025). In agent systems research, “Mesh Memory Protocol” describes a semantic infrastructure for multi-agent cross-session collaboration and explicitly frames a visualization and governance layer atop CAT7, SVAF, lineage, and remix as a “MEMLENS” (Xu, 21 Apr 2026). A separate systems paper on AMP/memorywire likewise presents a build-oriented blueprint for adopting a vendor-neutral memory protocol as the foundation of a “memory lens” over agent memory operations (Munirathinam, 31 May 2026).

These parallel usages do not describe the same object. In the 2026 multimodal benchmark paper, MemLens denotes a benchmark dataset and evaluation protocol for long-term multimodal memory in LVLMs (Ren et al., 14 May 2026). In the other cited works, the same or similar term denotes, respectively, a textual memory benchmark, a memorization detector, or an observability-and-governance layer for agent memory systems (Xia et al., 5 Feb 2025, He et al., 25 Sep 2025, Xu, 21 Apr 2026, Munirathinam, 31 May 2026). The overlap in naming reflects a shared concern with making memory behavior legible, but the underlying technical targets differ substantially.

7. Limitations, practical use, and future directions

The benchmark paper states several limitations. The sessions are synthetic and LLM-generated, even though human review improved naturalness and indistinguishability tests found no exploitable stylistic fingerprints. LLM-as-Judge remains slightly lenient, with corrections applied for short-output formats. Agent evaluation is structurally asymmetric because many pipelines store captions or embeddings rather than pixels, which limits visual fidelity before retrieval even occurs. Finally, MemLens evaluates frozen histories rather than streaming, causality-aware memory (Ren et al., 14 May 2026).

The proposed future directions follow directly from these limitations. The paper highlights hybrid architectures that combine long-context attention with structured multimodal retrieval; memory designs that preserve image-level evidence, whether as pixels or high-fidelity features, and support re-attention at query time; and joint optimization of retrieval accuracy, answer correctness, and calibrated abstention to avoid an RL-induced hallucination tax (Ren et al., 14 May 2026). These suggestions are not presented as solved methods but as architectural priorities implied by the observed failure modes.

The benchmark also includes explicit guidance for use. Recommended evaluation begins with the 32K full set, then extends to 64K and 128K to quantify degradation; agents can additionally be tested at 256K. The benchmark advises reporting per-type scores rather than only overall accuracy and including the coverage/per-answer-accuracy decomposition to separate calibration from competence (Ren et al., 14 May 2026). For memory agents, the guidance is to avoid caption-only storage, retain pixel-level evidence or high-resolution visual tokens, and design retrieval so that the backbone can re-encode or reason over original visual evidence. For LVLMs, the recommended mitigations for length-related degradation include stronger retrieval structure through session markers and indexes, prompts that emphasize timestamps and entity anchors, and sufficient generation budgets to avoid truncation (Ren et al., 14 May 2026).

MemLens is therefore best understood as both a benchmark and a methodological argument. Its empirical pattern—strong short-context visual grounding for long-context LVLMs, length stability but visual weakness for memory agents, and a persistent multi-session reasoning bottleneck—suggests that multimodal long-term memory is not reducible to either larger native context windows or external memory in isolation (Ren et al., 14 May 2026). The benchmark’s design, especially its image-ablation validation and cross-modal length control, makes that claim testable in a standardized form.