Spatio-Temporal Embodiment Memory (STEM)

Updated 4 July 2026

STEM is a memory architecture for embodied agents that integrates spatial layouts, temporal events, and embodiment data.
It employs persistent, queryable representations and dynamic scene graphs to support robust decision-making and object permanence.
Empirical evaluations show STEM systems dramatically improve long-horizon planning and retrieval efficiency across varied robotic tasks.

Spatio-Temporal-Embodiment Memory (STEM) denotes an integrated memory architecture for embodied agents in which spatial scene structure, temporal event history, and embodiment-specific state are stored in a persistent, queryable representation. In recent robotics and embodied-LLM work, STEM-like systems appear as object-centered 3D memories, episodic spatio-temporal stores, dynamic scene or knowledge graphs, and shared multi-robot world models. Across these formulations, the common objective is to avoid stateless observation-to-action behavior by preserving where entities are, what has happened to them, and how actions or agent capabilities constrain subsequent decisions (Huang et al., 13 Mar 2026, Tan et al., 30 Oct 2025, Lei et al., 14 Feb 2025).

1. Conceptual definition and scope

A compact formalization appears in RoboOS-NeXT, which defines shared memory at time $t$ as

$M(t) = \big(S(t),\,T(t),\,E(t)\big),$

where $S(t)$ is spatial memory, $T(t)$ is temporal memory, and $E(t)$ is embodiment memory (Tan et al., 30 Oct 2025). STMA expresses a closely related decomposition at the belief-state level,

$b_i = (b_i^t, b_i^s),$

with $b_i^t$ summarizing temporal interaction history and $b_i^s$ encoding current spatial configuration in a dynamic knowledge graph (Lei et al., 14 Feb 2025). STAR, in turn, defines long-term memory as a task-agnostic store of spatio-temporal observations,

$m_t = (t, x_t, \text{embed}(o_t), o_t),$

and couples it to a working memory that records action–outcome pairs during execution (Chen et al., 18 Nov 2025).

These formulations share three commitments. First, memory is persistent across multiple decision points rather than reconstructed from scratch. Second, spatial grounding is explicit: coordinates, scene graphs, object relations, or room/place indices are first-class memory fields. Third, embodiment is represented either through action traces, robot poses, or explicit capability profiles. RoboStream makes this motivation explicit by arguing that VLM-based planners fail on long-horizon manipulation because they lack “persistent geometric anchoring” and “memory of action-triggered state transitions,” leading to catastrophic forgetting under occlusion and cascading precondition violations (Huang et al., 13 Mar 2026).

The literature also uses related terminology in different ways. STEM-COVID uses “Spatio-Temporal Episodic Memory” for a fusion-ART architecture over agent trajectories and labels, whereas RoboOS-NeXT uses “Spatio-Temporal-Embodiment Memory” for a shared multi-robot world model (Hu et al., 2020, Tan et al., 30 Oct 2025). This suggests that STEM is best understood as an architectural family rather than a single canonical data structure.

2. Representational primitives

STEM systems differ chiefly in the granularity of their memory units. Some are object-centric and geometry-grounded; others are episodic tuples indexed by place and time; others are hybrid graph-plus-context memories.

RoboStream defines an object-level primitive called the Spatio-Temporal Fusion Token: $\tau_i^t = \Big\langle \mathbf{v}_i^t,\; \mathbf{c}_i^t,\; \mathbf{s}_i^t,\; t \Big\rangle,$ where $M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 0 is visual evidence, $M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 1 a 3D centroid, $M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 2 a Gaussian-style shape descriptor, and $M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 3 a timestamp (Huang et al., 13 Mar 2026). These tokens are organized into a Causal Spatio-Temporal Graph,

$M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 4

with object nodes, spatial edges, and an event log $M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 5 recording planned displacements, collisions, occlusions, action execution, and subtask completion. The result is a 4D scene memory that binds “what, where, when,” plus causal state transitions.

RoboOS-NeXT uses a more explicitly hierarchical spatial substrate. Its spatial memory is a scene tree

$M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 6

with root, region, and carrier nodes, and per-carrier object graphs

$M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 7

whose nodes store intrinsic properties, dynamic state, and pose

$M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 8

and whose edges encode typed spatial predicates such as $M(t) = \big(S(t),\,T(t),\,E(t)\big),$ 9, $S(t)$ 0, $S(t)$ 1, $S(t)$ 2, $S(t)$ 3, $S(t)$ 4, and $S(t)$ 5 (Tan et al., 30 Oct 2025). Embodiment is represented separately by per-robot profiles

$S(t)$ 6

capturing location, capabilities, resources, sensor state, and availability.

A more retrieval-oriented formulation appears in ReMEmbR and STAR. ReMEmbR stores segment-level records in a vector database: caption embeddings, captions, positions, and timestamps are inserted as

$S(t)$ 7

where each entry summarizes a short video segment for later spatial, temporal, or semantic retrieval (Anwar et al., 2024). STAR uses tuples of the form $S(t)$ 8 in a non-parametric long-term memory, with a separate working memory

$S(t)$ 9

to accumulate task-focused action–outcome history (Chen et al., 18 Nov 2025).

3DLLM-Mem adopts dense 3D token memories. Current observation tokens $T(t)$ 0 form working memory, while past observations are stored as episodic memory $T(t)$ 1. A memory projection and temporal embedding yield memory features, and fused memory-enhanced tokens are formed as

$T(t)$ 2

where current working-memory queries attend into the episodic memory bank (Hu et al., 28 May 2025).

System	Core memory unit	Stored factors
RoboStream	STF-Token $T(t)$ 3	Visual evidence, 3D geometry, shape, time
RoboOS-NeXT	$T(t)$ 4	Scene graph, event log, robot profiles
STAR	$T(t)$ 5	Time, pose, semantics, raw observation
ReMEmbR	Segment record in vector DB	Caption, embedding, pose, timestamp
3DLLM-Mem	Working/episodic 3D tokens	Dense 3D features, time, fused retrieval context

A plausible implication is that STEM admits both symbolic and sub-symbolic realizations. Object graphs, vector databases, and dense 3D token banks all satisfy the same higher-level requirement: persistent alignment of space, time, and embodied interaction.

3. Memory operations and control loops

STEM systems are defined not only by representation but also by a write–update–read loop tightly coupled to action selection. RoboStream makes this explicit. Given RGB-D observation $T(t)$ 6, goal $T(t)$ 7 or language, and past memory $T(t)$ 8, it encodes STF-Tokens

$T(t)$ 9

updates the causal graph

$E(t)$ 0

queries a VLM planner for a semantic directive

$E(t)$ 1

and then deterministically instantiates a 6-DoF action

$E(t)$ 2

This separates high-level semantic planning from low-level geometric execution and makes memory the central world model (Huang et al., 13 Mar 2026).

STAR expresses a complementary control formalism. The ideal policy depends on instruction $E(t)$ 3, long-term memory $E(t)$ 4, and post-task observations: $E(t)$ 5 but is approximated using working memory $E(t)$ 6: $E(t)$ 7 Crucially, the action space is unified: a step can be either a temporal memory query or a physical action in the environment. Memory queries return

$E(t)$ 8

and working memory is updated by

$E(t)$ 9

In this formulation, “search in time” and “search in space” are treated symmetrically as tools in a single memory–action loop (Chen et al., 18 Nov 2025).

ReMEmbR uses a retrieval-augmented variant of the same pattern. It posits that only a subset $b_i = (b_i^t, b_i^s),$ 0 of the full robot history is needed to answer a question, and approximates

$b_i = (b_i^t, b_i^s),$ 1

where $b_i = (b_i^t, b_i^s),$ 2 is a vector database over caption embeddings, positions, and timestamps. An LLM-agent iteratively calls text, position, and time retrieval tools to construct a relevant support set before producing a structured answer in text, coordinates, time, or duration (Anwar et al., 2024).

STMA implements the write–read–plan loop in a text-only embodied setting. History $b_i = (b_i^t, b_i^s),$ 3 is summarized into a temporal belief $b_i = (b_i^t, b_i^s),$ 4, spatial triples are extracted into a dynamic knowledge graph $b_i = (b_i^t, b_i^s),$ 5, a task-relevant subgraph is retrieved via embedding similarity plus $b_i = (b_i^t, b_i^s),$ 6-hop expansion, and an aggregator converts that subgraph into a spatial belief $b_i = (b_i^t, b_i^s),$ 7. A planner generates a subgoal and action sequence, while a critic vetoes unsuitable actions and supplies feedback for replanning (Lei et al., 14 Feb 2025).

Taken together, these systems instantiate a common operational principle: STEM is not passive storage. It is an action-conditioned memory loop in which memory writes reflect new observations and consequences of executed actions, while memory reads are selective, task-dependent, and often iterative.

4. Object permanence, uncertainty, and memory management

One major function of STEM is persistence under partial observability. RoboStream treats object permanence as a first-class design target. When an object becomes occluded, its current STF-Token may disappear from direct observation, but its node remains in the CSTG with last known centroid, shape, and history, and an occlusion event is recorded in $b_i = (b_i^t, b_i^s),$ 8. This allows planning over hidden-but-persistent objects and supports precondition checking even when the target is not visible (Huang et al., 13 Mar 2026).

A second function is reliability under noisy semantics. UQ-DAAAM adds object-level uncertainty quantification to multi-view 4D memory. For object $b_i = (b_i^t, b_i^s),$ 9, caption embeddings are assembled into a weighted evidence matrix $b_i^t$ 0, a semantic scatter matrix is formed,

$b_i^t$ 1

and uncertainty is defined as

$b_i^t$ 2

Objects with $b_i^t$ 3 are treated as semantically unresolved and actively refined under a fixed budget by selecting additional high-quality views, querying a captioning VLM, and fusing the resulting descriptions. The paper further derives a one-step uncertainty update

$b_i^t$ 4

and proves that, under a first-order stochastic dominance assumption, higher-quality candidate views are more likely to reduce uncertainty (Zhang et al., 6 Jun 2026).

Memory management itself can also be embodiment-aware. The Spatially-Aware Transformer introduces place-centric episodic memory and an Adaptive Memory Allocator that chooses among deletion strategies such as FIFO, LIFO, Most-Visited-First-Out, and Least-Visited-First-Out. Rather than treating memory overflow as a purely temporal problem, it frames retention as a task-aware policy over place-indexed memories. This design improves performance under constrained capacity in place-centric reasoning, generation, and RL settings (Cho et al., 2024).

At the multi-agent level, RoboOS-NeXT addresses persistence and consistency through a shared event-reduced state: $b_i^t$ 5 with temporal memory implemented as an append-only queue of timestamped scene and embodiment deltas. Robot heartbeat updates maintain near-real-time capability and availability profiles, enabling dynamic reallocation when a robot becomes busy or offline (Tan et al., 30 Oct 2025).

A recurring misconception is that long-context models alone solve long-horizon memory. VL-MemKnG and ReMEmbR argue otherwise in different ways: long-context VLMs remain computationally expensive for repeated querying, while graph-only retrieval can underrepresent broader temporal continuity. Their hybrid designs imply that efficient STEM often requires both structured relational memory and contextual episodic memory rather than a single monolithic context buffer (Lukina et al., 15 Jun 2026, Anwar et al., 2024).

5. Empirical evidence

Across benchmarks, explicit STEM-like memory improves long-horizon embodied performance relative to stateless, memory-light, or retrieval-only baselines.

RoboStream reports 90.5% average success on long-horizon RLBench and 44.4% on real-world block-building tasks; on the latter, SoFar and VoxPoser each score 11.1% (Huang et al., 13 Mar 2026). In hide-and-restore tasks, designed to test object permanence under occlusion, RoboStream-235B reaches 88.9% while SoFar and VoxPoser score 0%. Its ablation study further shows that removing both CSTG and STF yields 12.0% average success, removing CSTG alone yields 14.5%, removing STF alone yields 79.5%, and the full model achieves 90.5%, indicating that causal graph memory is the dominant contributor to long-horizon robustness (Huang et al., 13 Mar 2026).

STMA reports a 31.25% improvement in success rate and a 24.7% increase in average score over the state-of-the-art model on 32 TextWorld tasks. Its ablations show that removing spatio-temporal memory collapses performance, while removing the critic sharply hurts harder levels, supporting the joint importance of memory and closed-loop validation (Lei et al., 14 Feb 2025).

ReMEmbR demonstrates that retrieval-augmented memory scales better than directly prompting long videos. On a 21.5-minute video, its latency is about 25 seconds per question and remains relatively constant with video length, whereas a VLM baseline requires about 90 seconds per question for a shorter 5.5-minute video and cannot handle medium or long videos in the reported setup (Anwar et al., 2024). VL-MemKnG, which combines a spatio-temporal knowledge graph with segment-level contextual memory, improves Top-1 retrieval accuracy from 58% to 67% and Recall@1 from 34.50% to 40.55% on WalkieKnowledgeT+, with especially strong gains on temporal-global and temporally scattered aggregation questions (Lukina et al., 15 Jun 2026).

In embodied 3D planning, 3DLLM-Mem attains 37.6% average success rate in-domain and 32.1% in-the-wild on 3DMem-Bench embodied tasks, outperforming most-recent-memory and RAG baselines. On the most challenging in-the-wild embodied tasks, it exceeds the strongest baseline by 16.5% in success rate (Hu et al., 28 May 2025).

Uncertainty-aware memory also translates into better downstream reasoning. On OC-NaVQA, UQ-DAAAM improves descriptive QA accuracy to 0.761, reduces position error to 37.84 m, and reduces temporal error to 1.589 min, outperforming DAAAM and other uncertainty baselines. Its uncertainty reduction rate reaches 92.3% at refinement budget $b_i^t$ 6, compared with 34.2% for quality-based refinement without uncertainty and 5.6% for random refinement (Zhang et al., 6 Jun 2026).

RoboOS-NeXT supplies multi-robot evidence. In its ablation, removing spatial memory yields SR 24.2 with AEST 58.1, removing temporal memory yields SR 38.3, and removing embodiment memory yields SR 0.0, whereas the full STEM achieves SR 89.2 and SS 7.69 in the tested household setting. Under failures such as robot offline, tool failure, and brain hallucination, the full framework preserves substantially higher success than a memory-less baseline (Tan et al., 30 Oct 2025).

6. Variants, misconceptions, and open problems

Current work shows no single dominant STEM architecture. Some systems are graph-centric, such as RoboStream’s CSTG, RoboOS-NeXT’s scene tree plus object graphs, STMA’s dynamic KG, and VL-MemKnG’s spatio-temporal knowledge graph (Huang et al., 13 Mar 2026, Tan et al., 30 Oct 2025, Lei et al., 14 Feb 2025, Lukina et al., 15 Jun 2026). Others are retrieval-centric, such as STAR’s tuple-indexed long-term memory and ReMEmbR’s vector database over captions, positions, and timestamps (Chen et al., 18 Nov 2025, Anwar et al., 2024). Others remain token-centric and dense, such as 3DLLM-Mem’s working and episodic 3D feature banks (Hu et al., 28 May 2025). This suggests that STEM is defined more by function than by implementation.

A second misconception is that embodiment need only mean action history. In some papers embodiment is implicit, as in RoboStream where actions are recorded through causal effects on object nodes, or STAR where embodied actions are treated as memory-access tools (Huang et al., 13 Mar 2026, Chen et al., 18 Nov 2025). In others, embodiment is explicit and persistent: RoboOS-NeXT stores robot capabilities, resources, and availability in $b_i^t$ 7, and its ablations show that removing embodiment memory causes complete failure in the reported setting (Tan et al., 30 Oct 2025).

Open problems are consistent across the literature. RoboStream identifies the absence of fully end-to-end visual-language-action integration and notes the need to internalize spatio-temporal reasoning and causal memory within the action-generation loop (Huang et al., 13 Mar 2026). STAR highlights memory growth, compression, and explicit object identity tracking as unresolved issues, while also relying on prompt-based control rather than learned policies over the memory–action loop (Chen et al., 18 Nov 2025). ReMEmbR points to memory redundancy, lack of richer spatial structures such as scene graphs or semantic maps, and continued dependence on caption quality (Anwar et al., 2024). VL-MemKnG builds its graph offline and remains sensitive to caption errors, implying a need for incremental updating and richer relation schemas (Lukina et al., 15 Jun 2026). UQ-DAAAM shows that uncertainty over multi-view captions is tractable, but also notes that systematic VLM misrecognition can yield low uncertainty and still-wrong semantics, so semantic confidence is not yet equivalent to semantic correctness (Zhang et al., 6 Jun 2026). RoboOS-NeXT leaves distributed consistency, bandwidth constraints, and large-scale decentralized synchronization largely open (Tan et al., 30 Oct 2025).

A plausible synthesis is that mature STEM systems will need four properties simultaneously: persistent object- or scene-level grounding in metric or relational space; explicit temporal indexing of events and state transitions; embodiment-aware control interfaces that expose capabilities, costs, and actions; and mechanisms for selective retrieval, compression, and uncertainty-aware refinement. The recent literature establishes each ingredient separately and, in a growing number of cases, in combination.