MemLens Benchmark for Long-Term Memory
- MemLens Benchmark is a dual framework assessing long-term memory in vision-language and language models via multimodal tests and activation trajectory analysis.
- It evaluates core abilities like information extraction, multi-session reasoning, and temporal reasoning in LVLMs, while detecting memorization in LLMs.
- Key findings highlight that retrieval bottlenecks limit performance, urging the development of hybrid models that integrate extended context attention with external memory.
MemLens Benchmark encompasses two distinct, state-of-the-art benchmarks addressing long-term memory in large models: (1) MemLens for evaluating multimodal memory in vision-LLMs (LVLMs), and (2) MemLens for probing memorization in LLMs via activation trajectory analysis. These benchmarks diagnose current model limitations and rigorously quantify memory-oriented capabilities, defining new standards for both multimodal and textual model evaluation (Ren et al., 14 May 2026, He et al., 25 Sep 2025).
1. Motivation and Objectives
Long-term memory is a critical component in two classes of models: LVLMs operating over multi-session, multimodal interactions, and LLMs subject to potential contamination and memorization. In the multimodal regime, existing benchmarks have notable limitations—such as permitting text-only shortcuts, lacking control over context length, or focusing primarily on document-level rather than sessional memory. For LLMs, lexical overlap and perplexity-based memorization detectors fail to generalize under paraphrase or implicit contamination, leaving a key vulnerability in model evaluation.
MemLens for LVLMs was introduced to systematically compare two dominant methods:
- Long-context LVLMs: Models with extended context windows, processing interleaved images and text directly.
- Memory-augmented agents: Models that compress and selectively retrieve from an external memory store.
For LLMs, MemLens operationalizes memorization detection by leveraging activation trajectories, not surface-level token overlap, to distinguish "contaminated" (memorized) from clean samples.
2. Benchmark Design
Multimodal Memory Benchmark
MemLens (LVLMs) evaluates memory across five core abilities:
- Information Extraction (IE):
- Two-hop entity questions disambiguate entities via images, requiring visual and textual reasoning.
- Fine-grained subskills: recognition, counting, spatial reasoning, OCR.
- Multi-Session Reasoning (MSR):
- Tasks such as visually anchored counting, arithmetic with cross-modal operands, and cross-session entity resolution.
- Temporal Reasoning (TR):
- Duration comparison and chronological sorting using both textual and visual temporal cues.
- Knowledge Update (KU):
- State update consistency across multiple user-preference revisions.
- Answer Refusal (AR):
- Calibrated abstention when no supporting evidence is present.
Dataset composition:
| Attribute | Value / Range | Description |
|---|---|---|
| Number of questions | 789 | Balanced across five abilities |
| Context lengths | 32K, 64K, 128K, 256K tokens | Cross-modal token counting () |
| Sessions per instance | 14 (32K) to 93 (256K) | Simulated via web image retrieval & LLM-driven dialog |
| Images per instance | 20 to 138 | 4,695 unique source images across all examples |
The data pipeline interleaves evidence-rich and distractor sessions, employing entity abstraction to ensure multimodal, not text-only, question resolution.
Memorization Detection Benchmark
MemLens (LLMs) analyzes the activation trajectory of numeric tokens across transformer layers for a given input. For each layer :
- Residual stream projected as
- Softmax over logit slice for digits to yield
- Aggregates layerwise entropy and max-confidence
- Full trajectory assembled for downstream classification
3. Evaluation Protocol
LVLM Benchmark
- 789 questions are evaluated zero-shot by 27 LVLMs at the three shortest context levels; 7 memory-augmented agents run over a 195-question canonical subset at all context lengths.
- An image-ablation study confirms that for 80.4% (634/789) of questions, removing all images reduces leading LVLMs (GPT-5.4, Gemini-3.1-Pro) from 89–93% to ≤2% accuracy.
- Answer quality is measured by an LLM-as-Judge protocol (Qwen3-VL-235B-Instruct) and cross-validated against human and model annotators.
- Metrics comprise overall accuracy, coverage, per-ability accuracy, and calibration versus reasoning trade-offs.
Memorization Detection Benchmark
- A 1D-CNN classifier is trained per sample on the full activation trajectory .
- The detector's binary threshold (Youden’s J maximization) produces high true positive and low false positive rates on clean data.
- Controlled LoRA fine-tuning injects contamination, enabling direct monitoring of "shortcut" emergence via trajectory features.
4. Key Experimental Findings
LVLMs and Memory Agents
- Short-context regime (32K): Eight top LVLMs are within 6.3% accuracy (max 58.7%); differences collapse at this scale.
- Long-context regime (128K+): Most open-weight LVLMs degrade by 13–20%; only select proprietary models (Gemini-3.1-Pro, GPT-5.4) maintain ~50% accuracy. Memory agents show length-invariant performance () but plateau at 15–32% accuracy.
- Ability-specific ceilings: Highest reported performances are 74.4% (IE), 60.8% (TR), 50.9% (KU), and 44.1% (MSR at 32K, <30% for most).
- Error sources: For IE/KU, ~90% of errors are grounding failures (retrieval failure); MSR errors are 73% reasoning mistakes, but oracle retrieval restores MSR accuracy to 90–100%, implicating fault in retrieval, not reasoning.
- Model specialization: Performance varies substantially by ability, e.g., GLM-4.6V excels in TR but fails at KU, while only Gemini-3.1-Pro remains competitive across multiple abilities under extended context.
Activation Trajectory Memorization Detection
- Contaminated vs. clean signature: "Shortcut" trajectories manifest early, sharp entropy drop and channel dominance within 10 layers for memorized samples; clean data accumulates confidence gradually.
- Controlled LoRA injection: Increased fine-tuning rank yields monotonically increasing detection probability and mirrors rises in EM and Rouge-L scores.
- Detection robustness: Original and lightly rephrased samples yield strong detection rates (e.g., 95.6% contaminated flagged, <6% false positive for Qwen2.5-7B Original); paraphrased or translated samples are more challenging (<20–60% recall).
- Comparison: MemLens activation-based detection outperforms completion-match and perplexity-based detectors, which fail under paraphrasing or introduce high false positives.
5. Limitations
- LVLM MemLens: Performance on MSR remains universally below 30% for most systems due to retrieval, not inherent reasoning, limits. Neither attention-based nor memory-augmented methods alone suffice due to trade-offs between evidence fidelity and context retention.
- Activation Trajectory MemLens: Focuses on numeric token trajectories; generalization to arbitrary token classes, code, or language tasks requires extension. Use of per-layer features means detector performance can degrade if only shallow layers are available; some translation settings remain difficult.
6. Implications and Future Directions
MemLens benchmarks demonstrate that neither long-context attention nor memory-augmented retrieval is singly adequate for robust, faithful long-term multimodal memory. The universal bottleneck is retrieval of appropriate evidence, especially under cross-session and multi-turn conditions—a plausible implication is that retrieval architectures, indexing methods, and higher-fidelity (pixel or region-level) memory stores are needed.
Hybrid architectures that combine extended context-window attention with structured external memory for evidence selection and replay are explicitly motivated. Strategic expansions include per-ability sub-benchmarks, multimodal retrieval metrics grounded in pixel-level detail, and activation-trajectory detection for copyright auditing and open-domain search.
The MemLens suite—including versioned datasets, images, prompts, and evaluation harnesses—provides standardized infrastructure for reproducible benchmarking at scale (Ren et al., 14 May 2026, He et al., 25 Sep 2025).