WorldMM Episodic Memory Overview

Updated 19 January 2026

WorldMM Episodic Memory is a multimodal, time-indexed system that organizes chronologically segmented events for precise recall.
It employs hierarchical segmentation, captioning, and embedding to convert video clips into structured factual triplets and knowledge graphs.
Empirical results show significant improvements in temporal retrieval accuracy and event-grounded reasoning in long-duration streams.

WorldMM Episodic Memory serves as a dynamic, multimodal, time-indexed storage and retrieval mechanism that enables agents—especially those dealing with real-world, extended, or multimodal data streams—to recall, recombine, and reason over temporally situated events. Its architectural innovations expand the capabilities of traditional neural memory architectures by supporting variable temporal granularity, structural factual retrieval, and longitudinal reasoning across both textual and visual modalities. As instantiated in the WorldMM agent, it enables near-optimal retrieval of relevant past experience from hours or days of video, providing factual event recall for complex temporal queries in embodied, video-based, or interactive environments (Yeo et al., 2 Dec 2025). The following sections analyze the conceptual foundations, technical mechanisms, organization of data, operational benefits, and empirical benchmarks of WorldMM Episodic Memory and its place among other episodic architectures.

1. Conceptual Foundations and Role

The WorldMM Episodic Memory module is designed as a "chronological archive"—a repository that maintains a comprehensive, temporally ordered record of factual events. Unlike semantic memory, which stores high-level patterns or knowledge, and visual memory, which encodes raw or detailed perceptual frames, episodic memory emphasizes the retention of discrete, timestamped event records at multiple timescales. Each record encapsulates a captioned and structured representation (e.g., knowledge triplets) of a temporally bounded segment, supporting precise recall of "what happened when" across highly variable durations.

The core motivation is to facilitate factual recall and evidence grounding for queries that require reference to specific episodes ('What occurred between 1:00–2:00 AM?' or 'Who entered the room immediately prior to the event?') even in environments comprising tens of hours of video or interaction (Yeo et al., 2 Dec 2025).

2. Memory Construction, Segmentation, and Encoding

WorldMM Episodic Memory construction is organized around hierarchical segmentation and embedding.

Segmentation: The input stream (e.g., a long video $V$ ) is segmented into unit clips at a base timescale $t_0$ (e.g., 10–30s), then hierarchically into progressively larger, non-overlapping windows $\mathcal{T} = \{t_0, t_1, ..., t_N\}$ corresponding to different temporal granularities.
Captioning and Triplet Extraction: Each segment $V_{t_i}^k$ is captioned using a video-LLM captioner to generate $c_k^{(i)}$ . A LLM-based extractor maps $c_k^{(i)}$ to one or more factual triplets $\{(e_{k,1},a_{k,1},e'_{k,1})\}$ .
Embedding: Each segment's summary or triplet structure is embedded into a fixed-length vector $v^{(i)}_k = \mathrm{Embed}(c_k^{(i)}) \in \mathbb{R}^d$ , and further projected to a memory key $k^{(i)}_k = W_k v^{(i)}_k + b_k \in \mathbb{R}^{d'}$ .
Index Construction: For each timescale $t_i$ , an index is built storing tuples $(k^{(i)}_k, v^{(i)}_k, \tau^{(i)}_k)$ , where $\tau^{(i)}_k$ is a timestamp. At each scale, knowledge graphs $G_{t_i}$ are constructed whose nodes are the extracted triplets.

Formally, the episodic memory contents are: $\mathcal{M}_e = \{ G_{t_0}, G_{t_1}, ..., G_{t_N} \}$ This multi-scale approach allows the agent to access either fine-grained or coarse summary information, as required by the temporal resolution of the query (Yeo et al., 2 Dec 2025).

3. Indexing, Retrieval, and Temporal Reasoning

Efficient recall at different temporal resolutions is central to WorldMM Episodic Memory.

Key Indexing: All keys $\{k^{(i)}_k\}$ are stored for each timescale in separate approximate nearest-neighbor (ANN) indices. In some versions, graph-based indices with Personalized PageRank (PPR) further augment search to capture connectivity among factual nodes.
Query and Scoring: Given a query $q$ (e.g., a linguistic question or multi-modal prompt), it is embedded as $q_\mathrm{emb} \in \mathbb{R}^{d'}$ . Each candidate key $k^{(i)}_k$ is scored by scaled dot-product attention:

$\alpha^{(i)}_k = \frac{\exp(q_\mathrm{emb} \cdot k^{(i)}_k / \tau)}{ \sum_{j} \exp(q_\mathrm{emb} \cdot k^{(i)}_j / \tau)}$

where $\tau$ is a temperature hyperparameter controlling sharpness.

Aggregation: The corresponding values are aggregated at each timescale to a retrieved memory vector:

$r^{(i)} = \sum_k \alpha_k^{(i)} v_k^{(i)}$

Cross-Scale Reranking: To select the most relevant temporal resolution, a cross-scale reranker (LLM prompt) compares outputs $\{ r^{(0)}, \dots, r^{(N)} \}$ and outputs the best evidence for answering the query.

This approach allows WorldMM to flexibly retrieve episodic content at the most appropriate granularity for the task (Yeo et al., 2 Dec 2025).

4. Empirical Benefits and Benchmarks

WorldMM Episodic Memory's dynamic multi-scale approach yields measurable advantages in long-horizon tasks. In EgoLifeQA, switching from a single-scale to the full multi-scale graph memory increased QA accuracy from 51.8% to 56.4%. In dynamic temporal scope retrieval experiments, the module achieved a temporal intersection-over-union (tIoU) of 10.09%, outperforming fixed-scale episodic baselines that attain around 4%. On aggregated long-video question-answering benchmarks, these advances result in an average performance gain of +8.4% over the prior state-of-the-art (Yeo et al., 2 Dec 2025). These results directly reflect the advantage of multi-scale event segmentation and flexible hierarchical retrieval for tasks spanning hours or days of real-world experience.

5. Pseudocode and Operational Workflow

A summary of the operational workflow for constructing and retrieving from WorldMM Episodic Memory is as follows:

// Construction
for each timescale t_i in T:
  for each clip V_t_i^k in video V:
    c = Caption(V_t_i^k)
    triplets = ExtractTriplets(c)
    v = Embed(c)
    k = W_k v + b_k
    InsertIntoIndex(scale=t_i, key=k, value=v, time=segment_time)

// Retrieval (for query q)
q_emb = Embed(q)
recovered = []
for each timescale t_i in T:
  cands = RetrieveTopK(scale=t_i, q_emb)
  α = softmax({q_emb · k_cand} / τ)
  r_i = Σ_cand α_cand · v_cand
  recovered.append((t_i, r_i))
// Cross-scale Rerank by LLM
best_scale, best_r = RerankWithLLM(q, recovered)
return best_scale, best_r

No end-to-end loss is imposed on the memory projections; memory is constructed offline using pretrained LLMs for scoring and PPR for graph indexing.

WorldMM Episodic Memory distinguishes itself from earlier episodic memory systems in several respects:

Temporal Granularity: Unlike fixed-batch memory banks in episodic planning networks or single-timescale DNDs in differentiable memory architectures (Ritter et al., 2018, Pickett et al., 2016), WorldMM explicitly supports and reasons over multiple levels of temporal segmentation.
Factual Structure: Each segment is converted into a knowledge graph of factual triplets, enabling semantically structured access analogous to the event-graph memories in agents with relational/world-modeling objectives (Yeo et al., 2 Dec 2025).
Multimodal Integration: The retrieval agent adaptively selects among visual, semantic, and episodic memories, optimizing for the modality and temporal scale best suited to the question.
Cross-Scale Reranking: The use of LLM-based reranking across scales enables nonparametric selection of the optimal evidence span.
Offline Construction and Retrieval: Contrasted with architectures integrating episodic learning and gradient propagation (e.g., ESWM (He et al., 19 May 2025) or GRU-based actor-critics with prioritized episodic replay (Zhang et al., 2021)), WorldMM decouples episodic construction from task loss, relying on offline LLMs and ANN/PPR indices for its memory dynamics.

A plausible implication is that future modular world models for RL or embodied agents may augment or replace pure working memory architectures with WorldMM-style episodic subsystems for improved temporal coverage, evidence retrieval, and factual grounding in complex open-ended environments.

7. Limitations, Extensions, and Open Questions

Current WorldMM episodic memory relies on offline construction and does not use gradient-based end-to-end optimization for its retrieval projections or aggregation. As a result, adaptation to fast-changing environments or novel query semantics may be limited compared to memory modules trained in situ. The current fact-extraction pipeline and hierarchy construction are largely governed by LLMs and presegmented temporal windows; it remains an open direction to learn segmentation boundaries and evidence graphs jointly with agent objectives, as well as to integrate contrastive or other self-supervised regularization for retrieval keys (Yeo et al., 2 Dec 2025). Another important question is the extension to settings where episodes are not naturally demarcated (e.g., continuous RL or streaming sensor data), as well as the fusion of imagined/future events with real historical episodes for hybrid planning, as explored in hybrid memory systems (Pan et al., 2024).

In summary, WorldMM Episodic Memory represents an advanced instantiation of temporally-structured event recall, supporting hierarchical, fact-grounded, and multi-modal retrieval in long-horizon, complex environments. Its architectural principles and empirical validation position it as a reference model for future episodic memory modules in world modeling and embodied reasoning agents (Yeo et al., 2 Dec 2025).

Markdown Upgrade to Chat

References (6)

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning (2025)

Been There, Done That: Meta-Learning with Episodic Recall (2018)

A Growing Long-term Episodic & Semantic Memory (2016)

Building spatial world models from sparse transitional episodic memories (2025)

Episodic memory governs choices: An RNN-based reinforcement learning model for decision-making task (2021)

Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WorldMM Episodic Memory.