Persistent Multimodal & Entity-Centric Memory

Updated 9 October 2025

Persistent multimodal and entity-centric memory is a computational framework that encodes, retains, and retrieves multi-signal knowledge about entities across time.
It employs latent, addressable entity representations, transformer-based layers, and dynamic knowledge graphs for context-aware memory augmentation and efficient retrieval.
These systems are applied in QA, embodied AI, and robotics, demonstrating significant performance improvements in long-term recall and reasoning precision.

Persistent multimodal and entity-centric memory refers to computational mechanisms, architectures, and datasets by which artificial agents encode, retain, and retrieve knowledge about entities—objects, people, places, concepts—across modalities (text, image, audio, video, physiological signals) and time. These memory systems are designed to emulate aspects of human cognition, particularly the capacity for long-term, context-rich recall, entity consistency, association across experiences, and cross-modal reasoning. Recent developments have yielded frameworks that blend explicit entity memory with multimodal representations, cognitively inspired processing, continual memory augmentation, and efficient retrieval to meet demands in question answering, embodied AI, conversational analysis, assistive technologies, and more.

1. Architectures for Entity-Centric and Multimodal Memory

Recent frameworks implement entity-centric memory as latent, addressable representations within neural architectures. For example, EDMem (Zhang et al., 2022) integrates a large embedding table pre-trained on Wikipedia, storing dense vectors for millions of entities. When encountering the delimiting tokens for entity mentions ([E_s], [E_e]), the model queries the entity memory using hidden states:

$h^\mathrm{ent}_s = W_\mathrm{out} \left( \sum_{i=1}^N \alpha_i \cdot e_i \right), \quad \alpha_i = \frac{\exp(e_i^\top W_\mathrm{in} h^\mathrm{low}_s)}{\sum_j \exp(e_j^\top W_\mathrm{in} h^\mathrm{low}_s)}$

This approach allows direct, context-sensitive retrieval of entity knowledge within a unified encoder-decoder framework. Multi-modal extensions, such as continuous memory architectures (Wu et al., 23 May 2025), compress image-text pairs into a small set of continuous embeddings using VLMs and a Q-Former compressor:

$H^{(0)} = q \ H^{(\ell)} = \mathrm{TransformerLayer}^{(\ell)}(H^{(\ell-1)}, E_t),\quad V_t = H^{(L)}$

Such designs enable persisting multimodal knowledge at massive compression rates and support plug-and-play augmentation of underlying models without increasing token length.

Biologically inspired models like HippoMM (Lin et al., 14 Apr 2025) and RoboMemory (Lei et al., 2 Aug 2025) adopt hippocampus-like modules for pattern separation, completion, and consolidation, segmenting streams into episodic events and merging overlapping temporal windows to form context-rich memory traces. Dynamic Knowledge Graphs (KGs) provide scalable, entity-centric mapping of objects, locations, and relationships in spatial and semantic memory, supporting fast, consistent updates and parallelized retrieval.

2. Learning Mechanisms and Modalities

Modern systems leverage multimodal signals—images, audio, text, physiological data—to build and enrich memory. Cognitive map models (Stoewer et al., 2023) train multi-branch neural networks on successor representations:

$V(s) = \mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R(s_t) \mid s_0 = s \right] \ V(s) = \sum_{s'} M(s, s') R(s'),\quad M = \sum_{t=0}^\infty \gamma^t T^t$

Visual and semantic features are fused, supporting inference between modalities (e.g., predicting image or word embeddings from the other), with peak accuracies exceeding 90%. Memento (Ghosh et al., 28 Apr 2025) augments recall by detecting event-related potentials (ERPs) from EEG, GSR, and PPG wearables, identifying cognitive-attentional episodes which are then transformed into personalized navigation cues. SnapNTell (Qiu et al., 7 Mar 2024) and ECIS-VQG (Phukan et al., 13 Oct 2024) enrich entity reasoning by aligning visual regions, captions, and textual knowledge through adapters and contrastive learning.

3. Retrieval Strategies and Memory Augmentation

Major advances hinge on robust retrieval and augmentation strategies, enabling persistent recall over long-tail entities and complex temporal contexts. SnapNTell augments VQA by using GLIP for semantic detection, CLIP for embedding, Faiss for nearest-neighbor retrieval, and integrates external captions into LLM-based answer generation, leading to a 66.5% BELURT score improvement on entity-centric VQA.

Pensieve (Jiang et al., 22 Sep 2025), designed for Memory-QA tasks, augments memory tuples offline with OCR and detailed captions, computes multi-signal relevance (temporal, location, embedding similarity) during retrieval:

Signal type	Computation	Notes
Date match	$R_t(M_i, q, T_q)$	Time range alignment
Recency	$R_r$ via decay constants	Short/med/long
Location	BM25-based score $R_\ell$	Query-location match
Embedding sim.	$F(I_i, C_i, X_i, L_i) \in \mathbb{R}^d$	Multi-modal encoder

Scores are fused in re-ranking, and answers are generated via multi-memory QA tuning with noise-injected negatives for robustness.

OmniQuery (Li et al., 12 Sep 2024) structures episodic photo/video memory by atomic (caption, OCR, detected objects), composite (event clusters via overlapping windows), and semantic layers (LLM-inferred declarative knowledge). Queries are decomposed for precise filtering and augmented retrieval. Accurate answers (71.5%) reflect strong integration of personal multimodal and contextual cues.

4. Reasoning and Memory-Based Task Performance

Persistent multimodal memory systems demonstrably enhance reasoning for open-domain QA, embodied navigation, and long-video understanding. EDMem’s entity-constrained generation yields exact match improvements of +9%/6%/6% over baseline QA models. SnapNTell avoids hallucinations in entity-rich VQA (notably for long-tail instances). ECIS-VQG achieves high BLEU, ROUGE, CIDEr, and METEOR scores with self-complete, entity-centric questions, suggesting effective multi-modal entity anchoring.

Embodied agentic frameworks such as MemoryEQA (Zhai et al., 20 May 2025) and RoboMemory apply hierarchical, multi-modal memory across planner, reasoning, and execution modules, reducing redundant exploration and improving multi-target QA by nearly 20%. M3-Agent (Long et al., 13 Aug 2025) employs entity-centric memory graphs and multi-turn, RL-optimized iterative search, yielding 5–8% higher long-video QA accuracy over strong prompting baselines; ablations show reliance on semantic memory and iterative reasoning.

HippoMM demonstrates robust event segmentation and retrieval—pattern separation/completion and semantic replay—resulting in 78.2% accuracy vs. 64.2% for video RAG, accompanied by a fivefold response time speedup. Memento’s physiological-cue–based recall increases route memory by 20–23% and cuts recall effort nearly in half.

5. Practical Implications and Applications

Persistent multimodal and entity-centric memory methods are actively deployed in personal assistants (OmniQuery), cognitive augmentation tools (Memento), video learning and fact-checking (ECIS-VQG), robotic agents (RoboMemory, MemoryEQA), and conversational modeling (MeMo (Tsfasman et al., 7 Sep 2024)). The MeMo dataset underscores modeling conversational memory retention and its role in social connection, revealing that multimodal nonverbal and verbal cues can predict group-level memorability, with applications in facilitation and summarization systems.

Advanced QA systems leverage explicit entity memory and retrieval augmentation to manage dynamically evolving knowledge sets, improving reliability in fact-based dialogues and multi-hop reasoning. EntityCLIP (Wang et al., 23 Oct 2024) applies gated, multimodal attentive experts and contrastive learning, yielding state-of-the-art performance on news retrieval tasks.

6. Future Directions and Theoretical Considerations

Research trajectories include scalable lifelong memory integration (RoboMemory’s dynamic KG, parallel memory updates), advanced compression (CoMEM’s continuous memory, >80x rate), cross-modal and multilingual knowledge transfer (M3-Agent, CoMEM), and expansion into new modalities (auditory, biometric, real-time AR interfaces).

Challenges remain in context disambiguation, reference resolution, handling vague queries, mitigating hallucinations, and supporting entity persistence over distributed, multi-user or multi-agent systems. Methodological advances in context-aware multi-signal fusion, hierarchical abstraction, reinforcement learning–based memory querying, and dynamic updating will shape robust applications.

A plausible implication is that maximally efficient persistent multimodal and entity-centric memory systems will be modular, plug-and-play, support hierarchical abstraction and entity tracking, and integrate both episodic and semantic layers through dynamic, continual augmentation. This suggests strong utility for embodied agents, extended QA systems, personal archives, lifelong learning robotics, and memory-centric conversational analytics where referential and contextual integrity across time and modalities is paramount.