Episodic Memory Benchmark (EpBench)

Updated 12 November 2025

EpBench is a comprehensive evaluation suite that tests models' ability to encode, retrieve, and order context-specific event memories across video, text, and dialogue.
It organizes tasks such as memory type classification, episodic retrieval, event ordering, and answer synthesis, using metrics like precision, recall, and mean Average Precision.
EpBench spans multi-modal applications, addressing challenges from synthetic event generation to real-world temporal reasoning and multi-hop linking of distributed events.

The Episodic Memory Benchmark (EpBench) is a family of evaluation protocols and datasets designed to rigorously assess machine learning and LLMs in their ability to encode, retrieve, and reason about episodic memory—recall of events situated in time, space, and context, analogous to human episodic memory. EpBench spans domains from egocentric video and multi-modal event localization to long-context narrative text and dialogue, with the precise structure, taxonomy, and metrics varying by instantiation. EpBench distinguishes itself from earlier “semantic memory” benchmarks (e.g., LAMA, MMLU) by explicitly probing the binding and retrieval of contextually grounded personal or environmental episodes, including sequence ordering and multi-hop linking of temporally or spatially distributed events.

1. Episodic Memory: Definition and Taxonomy

Episodic memory in the EpBench context refers to explicit models and tasks that require the binding of “what” happened with “when,” “where,” and “who” elements, whether in video, text, or dialogue. The core representational unit is typically a structured event tuple or a temporally indexed narrative fragment. Taxonomies employed by EpBench variants include:

Semantic Memory: Decontextualized facts, profiles, and static relationships (e.g., job titles, social ties), not linked to unique episodes.
Episodic Memory: Context-bound events, typically timestamped and located, including free-form descriptions, observed dialogues, or narrative chapters.
Hybrid Personal Memory Datasets: As in PerLTQA, “episodic” entries encode events ("In June 2022, Alice traveled to Bali with Bob") and dialogue fragments rooted in recorded episodes, with explicit annotation of anchors, memory types, and ground-truth answers (Du et al., 26 Feb 2024).

In video-based benchmarks, the episode is defined as a temporally bounded segment—either an action in egocentric video or a locally grounded sub-sequence to be retrieved or classified. In text-based EpBench and synthetic “book” benchmarks, the episode generalizes to chapters or event records, each annotated with time, location, main entities, and narrative details (Rajesh et al., 10 Nov 2025, Huet et al., 21 Jan 2025). In dialogue-based instances, the episode often corresponds to a conversational turn or an interaction, temporally annotated and embedded within a longer sequence (Liu et al., 22 Feb 2025).

2. Task Structure and Benchmark Subtasks

EpBench instantiations are typically organized around three to four principal subtasks. These tasks are customized per modality (video, text, dialogue) but share a common formalism:

Memory Type Classification (as in PerLTQA): Given a natural language query, determine if the relevant information lies in semantic or episodic memory. Formally,

$\pi = MC(q), \quad \pi \in \{\mathrm{semantic}, \mathrm{episodic}\}$

with evaluation by precision, recall, F1, and accuracy (Du et al., 26 Feb 2024).

Episodic Memory Retrieval: Retrieve the set $k$ of the most relevant memory fragments (events, chapters, segments) given a query. Retrieval is category-wise (semantic/episodic), re-ranked via classification confidence and raw similarity score:

$s'_i = \alpha \cdot P(\pi|m_i) + \beta \cdot \sigma(s_i)$

where $s_i$ is the raw retrieval score, $\pi$ is the predicted memory type, and $\sigma$ is the sigmoid function (Du et al., 26 Feb 2024). Metrics include Recall@k and, in video, mean Average Precision (mAP) or Recall@1/5 at specific intersection-over-union (IoU) thresholds (Shao et al., 2023, Feng et al., 22 Jun 2024).

Event Ordering and Sequence Recall: As in Book-SORT, models are given two narrative segments $A, B$ and must identify which came first in the source document, probing strict temporal ordering as a proxy for episodic context encoding:

$\hat{a} = \underset{a \in \{A, B\}}{\mathrm{argmax}} P_\theta(a | I_{\mathrm{SORT}})$

with accuracy computed as $(1/N) \sum_{n=1}^{N} \mathbb{1}[\hat{a}_n = y_n]$ (Pink et al., 10 Oct 2024).

Episodic Answer Synthesis: Given retrieved memories, generate or extract the response to a query, ensuring grounding in the correct episodes. Evaluated by Correctness, Coherence (LLM-graded), Memory-anchor MAP, and structured extraction metrics (e.g., F1, Kendall's $\tau$ for order) (Huet et al., 21 Jan 2025, Du et al., 26 Feb 2024).
Temporal Dialogue Benchmarking (EM-Test): In dialogue, models are challenged to answer questions requiring recall over events, with time-stamped turns and queries ranging in span from “just now” to “several decades” (Liu et al., 22 Feb 2025).

3. Dataset Design and Construction Principles

EpBench datasets are constructed under strict contamination prevention protocols and controlled diversity:

Synthetic Generation: Events, entities, dates, and locations are synthesized from static vocabularies to guarantee zero contamination with web corpora and to control for distributional bias (Huet et al., 21 Jan 2025, Rajesh et al., 10 Nov 2025).
Temporospatial Repetition: To probe reasoning, the same date, location, or entity may appear in multiple episodes, causing ambiguous or multi-hop queries (e.g., “List all dates on which X appeared at location Y”) (Rajesh et al., 10 Nov 2025).
Validation and Filtering: Automated quality checks (parse validity, verbatim constraints, paragraph structure) and LLM-as-judge protocols ensure episodes are coherent and correctly attributed (Huet et al., 21 Jan 2025).
Multi-Modality: Video-based benchmarks utilize egocentric video (Ego4D) with untrimmed, action-segmented sequences; object- and text-level features are extracted for multi-modal fusion (InternVideo, EgoVLP, CLIP, etc.) (Shao et al., 2023, Feng et al., 22 Jun 2024).
Dialogue Annotation: Multi-agent simulation with explicit time-stamping generates multi-turn episodic dialogues for training and test (EM-Train and EM-Test) (Liu et al., 22 Feb 2025).
Open Sourcing: Code, datasets, and evaluation protocols are released under permissive licenses for reproducibility and extension (Huet et al., 21 Jan 2025, Pink et al., 10 Oct 2024, Liu et al., 22 Feb 2025).

4. Evaluation Metrics and Protocols

EpBench employs metrics appropriate to episodic recall and temporal ordering, with some commonly used across all variants:

Classification: Accuracy, F1, precision, and recall for memory-type prediction.
Retrieval: Recall@k for correct episode inclusion; mean Average Precision (mAP) over temporal IoU thresholds for video; and token-recall for text.
Sequence/Order: Exact match for order recall; Kendall's $\tau$ for sequence concordance; normalized temporal deviation ( $\Delta t$ ) for time estimation (Huet et al., 21 Jan 2025, Pink et al., 10 Oct 2024).
Answer Generation: F1, precision, recall over extracted entities/attributes; LLM-based correctness and coherence scoring; Memory Anchor MAP (proportion of answer matching ground-truth text spans).
Cost-Efficiency: Average tokens consumed per query and estimated monetary cost, crucial for scaling to ∼1M token corpora (Rajesh et al., 10 Nov 2025).
Failure Characterization: Hallucination rates on zero-event cues; rates of ordering errors and confabulations (Huet et al., 21 Jan 2025).

5. Representative EpBench Variants and Principal Findings

Table 1. Comparison of Key EpBench Variants

Variant	Domain	Task Structure	Performance Highlights
Ego4D EpBench	Egocentric video	Moment/Action Localization, NLQ	mAP=29.34, R@1=19.79 (top, strong sensitivity)
PerLTQA	Synthetic persona/QA	MC, Retrieval, Synthesis	BERT MC: F1=0.957; BM25 retrieval: Recall@1=0.705
Book-SORT	Text/books	Sequence Order Recall (SORT)	Llama 3-8B: 0.93; Humans: 56-64%; RAG fails for order
EpBench-2000	Synthetic book	Multi-hop QA, episodic retrieval	GSW F1=0.773; Embedding RAG=0.675
EM-Test	Dialogue	QA over temporally-anchored dialogue	Echo SimScore=84.0% (easy), 74.5% (hard); GPT-4: 72.3%, 67.7%

Notable findings include:

Models without external retrieval or context (vanilla LLMs, parametric memory) perform at random or near-chance on sequence order and multi-hop recall when context exceeds their window (Pink et al., 10 Oct 2024, Rajesh et al., 10 Nov 2025).
Embedding-based retrieval fragments the latent narrative, sharply reducing recall when multi-cue disambiguation is required (Rajesh et al., 10 Nov 2025).
Graph/structured retrievals can improve precision but often sacrifice recall of fine-grained or spatiotemporal cues.
Dedicated episodic-memory models such as Echo (dialogue) or frameworks like Generative Semantic Workspace (GSW) outperform standard RAG and LLM baselines by 7–20 points in F1 and up to 51% in token efficiency (Rajesh et al., 10 Nov 2025, Liu et al., 22 Feb 2025).
Even state-of-the-art LLMs (e.g. GPT-4o, Claude 3.5) struggle with recall and ordering when episodic cues multiply (6+ episode matches; latest-state, full-set retrieval) (Huet et al., 21 Jan 2025).

6. Design Insights, Failure Modes, and Limitations

EpBench analysis across modalities reveals several consistent insights and open challenges:

Accurate memory-type classification (MC) enables more effective retrieval and prevents hallucination; BERT-based MC outperforms instruction-tuned LLMs by ~30 points in F1 (Du et al., 26 Feb 2024).
Episodic memory, not semantic memory, is decisive for grounding QA and avoiding confabulation in personal or temporally extended settings (Du et al., 26 Feb 2024, Huet et al., 21 Jan 2025).
Scalability to 1M-token-level corpora requires compressive, structured memory (e.g., GSW), as naive context extension yields “lost in the middle” performance collapse.
Retrieval modules that discard temporal order degrade sequence-recall; order- or drift-preserving retrieval is essential (Pink et al., 10 Oct 2024).
All models—parametric, in-context, and RAG—struggle with precise ordering, empty-cue refusal, and latest-state entity tracking as cue multiplicity increases (Huet et al., 21 Jan 2025).
Human performance, while superior to zero-shot LLMs, dissipates under long spans and high ambiguity, pointing to intrinsic task difficulty (Pink et al., 10 Oct 2024).
Dataset synthesis, while eliminating contamination, does not guarantee real-world event plausibility or sim2real transfer; evaluation of real, complex scenarios remains an open research problem.

7. Extensibility and Prospects

EpBench protocols, datasets, and evaluation scripts are open source and designed for extensibility across modalities, domains, and scales. To adapt EpBench to new applications:

Define a temporospatial event schema with synthetic or human-annotated data.
Construct (or sample) episodes with variable repetition and ambiguity for multi-hop probing.
Implement or adopt subtasks (classification, retrieval, ordering, answer synthesis) per EpBench pipeline.
Evaluate using the standardized metrics (Precision, Recall, F1, MAP, order accuracy, cost) and ablate by cue, context length, and retrieval strategy.

Future directions identified include developing architectures with explicit episodic memory modules (memory banks, slot-based attention), devising RAG systems that preserve not only semantic content but also temporal and spatial context in retrieval, and extending beyond text to multi-modal and lifelong memory evaluation (Huet et al., 21 Jan 2025, Pink et al., 10 Oct 2024, Rajesh et al., 10 Nov 2025). A plausible implication is that progress on EpBench and its derivatives will be fundamental for deploying LLM and agentic systems in scenarios requiring continual, situated reasoning over extended temporal horizons.