Papers
Topics
Authors
Recent
Search
2000 character limit reached

EMemBench: Benchmarking Episodic Memory in VLM

Updated 30 January 2026
  • EMemBench is an interactive, programmatic benchmark that evaluates episodic memory in vision-language agents using reproducible, game-based simulations.
  • It employs automated, trajectory-conditioned query generation with adversarial templates to test diverse skills such as induction, spatial, and temporal reasoning.
  • Empirical findings reveal that even advanced VLM agents struggle with complex memory tasks, highlighting the need for improved memory architectures.

EMemBench is a programmatic and interactive benchmark designed to rigorously evaluate the long-term, visually grounded episodic memory of vision-LLM (VLM) agents as they act in simulated game environments. It introduces a procedurally controlled, reproducible framework for probing multiple dimensions of memory-intensive reasoning, including both text-based and image-rich (visual) settings. EMemBench distinguishes itself with its automated trajectory-conditioned question generation, explicit skill coverage, ground-truth answerability, and adversarial challenge templates, thereby setting a new baseline for memory diagnostics in VLM research (Li et al., 23 Jan 2026).

1. Formal Benchmark Structure and Memory-Skill Taxonomy

EMemBench transforms each agent’s own gameplay trajectory τ into a set of diagnostic queries designed to comprehensively interrogate episodic memory competence. An interaction episode is logged as

τ={(ot,at,rt)}t=1T\tau = \{(o_t, a_t, r_t)\}_{t=1}^{T}

where oto_t is the observation (image + HUD), ata_t is the action, and rtr_t is the reward. Underlying state S\mathcal{S} (e.g., map, inventory) is also recorded.

A generator function

G:(τ,S)Q={(qi,yi,mi)}i=1N\mathcal{G}:(\tau, \mathcal{S}) \mapsto Q = \{(q_i, y_i, m_i)\}_{i=1}^N

produces queries qiq_i (with answer yiy_i and metadata mim_i, such as skill label and evidence pointers). Each query is controlled by an explicit answerability predicate Pi(τ,S)P_i(\tau, \mathcal{S}), allowing for adversarial "not answerable" instances.

The benchmark is designed for balanced and stratified skill coverage, with the following categories:

  • Single-Hop Recall: Direct retrieval from a specific step, e.g., “At step s, what action did you take?”
  • Multi-Hop Recall: Compositional queries chaining retrievals, e.g., “After the first time you found X, what did you do next?”
  • Induction: Aggregation over intervals, e.g., “What is the longest sequence of action α in [L,R]?”
  • Temporal Reasoning: Event ordering/intervals, e.g., “Did event E₁ precede E₂?”
  • Spatial Reasoning: Navigation- and map-based displacement, e.g., “How would you reach the nearest lake from your current location?”
  • Logical Inference: Inventory or combinatorial state, e.g., “Do you have item I at step t?”
  • Adversarial Robustness: Templates with intentionally false premises; the correct answer is “not answerable.”

Notably, spatial questions involve groundable path reasoning via BFS on a dynamic map graph maintained internally for each game world.

2. Automated Question Generation and Verifiable Ground Truth

EMemBench employs a transparent and deterministic algorithm (modulo a fixed random seed) for question instantiation:

  1. Build timeline/event indices from the trajectory and game state.
  2. For each skill template, enumerate all possible candidate instantiations.
  3. Filter candidates by explicit preconditions (e.g., item must actually appear at any step for "first seen" queries).
  4. Sample a target number per skill per episode for balanced coverage.
  5. For each instantiated question, compute the definitive answer via direct access to game signals (actions, state transitions, map observations).
  6. Adversarial (“NA”) questions are generated by purposely violating the template’s preconditions—ensuring the answer should be “not answerable.”

This approach guarantees that answers are always algorithmically verifiable, sidestepping ambiguities of language or perception that plague many VQA-style benchmarks (Li et al., 23 Jan 2026).

3. Evaluation Metrics, Experimental Design, and Agent Integration

Per-question binary correctness (si=1[y^iyi]s_i = 1[\hat{y}_i \equiv y_i]) forms the basis for overall accuracy: Acc=1Ni=1Nsi\text{Acc} = \frac{1}{N}\sum_{i=1}^N s_i F1, Precision, and Recall are further computed with explicit support for the “not answerable” class.

EMemBench is applied to both text-based interactive fiction (15 Jericho games) and image-based visual games (Crafter, using multiple procedural world seeds). Each episode is capped (T=200 steps), with support for finite-horizon subsampling to probe memory at different timescales. Agents are evaluated in both play (“reason + act”) and QA phases: during play, the agent conditions on most-recent step context and optionally its persistent memory store; during QA, it receives the entire memory accumulated over the trajectory.

Memory agent architectures are compared:

  • In-context Only: Standard prompting with up to HH history turns.
  • Mem0: Flat key–value store with top-k similarity retrieval.
  • LangMem: Episodic events recorded as textual chunks in a memory buffer.
  • A-MEM: Graph-based memory of “notes” and “links,” supporting subgraph retrieval.

Prompts for QA answering concatenate the question, retrieved memory, and recent context before being passed to the VLM (e.g., Qwen3-VL-32B-Instruct, InternVL3.5-38B, GPT-5.1 for text and vision).

4. Empirical Results and Diagnostic Insights

EMemBench establishes that VLM agents—even those with advanced backbone architectures—remain far from saturating episodic memory reasoning benchmarks.

  • Accuracy: Top-performing text-only agents reach ≈52% overall accuracy; visual agents (Crafter) plateau at ≈44%.
  • Skill Bottlenecks: Induction and spatial reasoning remain persistently low (text: ~34%/~48%; visual: ~20%/~24%). Single-hop and temporal queries are relatively easier for strong LMs.
  • Effect of Memory Modules: Persistent memory (A-MEM, LangMem) yields substantial gains on text games (+7–10 pp overall); benefits are less pronounced or even negative for visual tasks, indicating current visually grounded memory architectures do not yet robustly support complex episodic queries.
  • Statistical Stability: All top-line results have standard deviations <0.05 across seeds.

Critical challenges persist in binding pixel-level visual observations to event-indexed memories and constructing spatially meaningful, compact representations. Induction over partially observable environments particularly highlights the need for innovations in object-tracking, pattern mining, and implicit map-like memory.

5. Distinctive Features and Coverage Relative to Prior Episodic Memory Benchmarks

EMemBench introduces several features absent in prior benchmarks:

  • Interaction-Dependent QA: All queries are conditioned on the agent’s specific trajectory and environment instance, ruling out solution by dataset biases or static shortcuts.
  • Skill Coverage Engineering: Balanced and explicit coverage of seven memory-related reasoning types, including adversarial negatives, ensures that success requires robust and general episodic memory.
  • Ground Truth from Underlying State: All questions and answers are computed from true environment state, decoupling evaluation from surface-level perception or synthetic annotation error.
  • Open-Domain Visual and Textual Integration: EMemBench is natively compatible with both text-based and visually rich agents, yielding a unified benchmark for cross-modal memory research.
  • Statistical Power: Large-scale sampling, stratified skill bins, and multi-seed environments enable reliable ranking and robust out-of-distribution generalization analysis (Li et al., 23 Jan 2026).

6. Limitations, Open Problems, and Prospects

Current limitations include:

  • Sub-saturating Performance: Even the best agents are far from the upper bound, partly due to inability to construct and reliably query map-like spatial memories from raw pixels.
  • Visual Memory Representation: Visual memory modules often fail to robustly support spatial and induction queries, especially in procedurally generated, partially observed visual settings.
  • Induction and Object Permanence: Persistent memory for identity-level tracking and temporal aggregation of events (e.g., “first time X happened, last time Y happened”) remains a fundamental open task.
  • Scaling and Coverage: Further broadening to more complex 3D environments, multiple object classes, and compositional QA remains future work.

The authors highlight future research trajectories in merging implicit spatial memories (learned map representations, e.g., topological or allocentric embeddings) with symbolic event logs, integrating policy-learned memory control, and expanding to richer task and world ontologies for next-generation VLM memory agents (Li et al., 23 Jan 2026).

References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EMemBench.