LIBERO-Mem: Non-Markovian Memory Benchmark

Updated 21 November 2025

LIBERO-Mem is a simulation benchmark for robotic manipulation that tests non-Markovian, object-centric memory in partially observable settings.
It employs a suite of sequenced subgoal tasks with a bespoke evaluation protocol to measure memory performance and action tracking over extended time frames.
The benchmark uses Slot Attention and an Embodied-SlotSSM model to sustain robust object identity and temporally extended memory representations.

LIBERO-Mem is a simulated, instruction-conditioned robotic manipulation benchmark designed to stress-test non-Markovian, object-centric memory requirements in partially observable environments. Unlike classical Markovian manipulation tasks where the next action is fully determined by the current scene and instruction, LIBERO-Mem introduces visually indistinguishable object instances whose correct disambiguation and manipulation depend on their hidden interaction histories. Agents must persistently perceive, track, and reason about objects and their episodic relations across hundreds of frames, requiring robust object-identity grounding and temporally extended memory representations. LIBERO-Mem operationalizes this challenge via a suite of sequenced subgoal tasks and a bespoke evaluation protocol, enabling rigorous investigation of memory architectures and temporally grounded visuomotor policies in the face of partial observability and semantic ambiguity (Chung et al., 14 Nov 2025).

1. Formal Structure and Memory State

LIBERO-Mem is formally expressed as a partially observable decision process augmented with object-centric memory. The environment at each timestep $t$ comprises:

A hidden world state $s_t$ : including object poses, visibility states, and per-object interaction counters or relational orders, for $K$ objects.
An observation $o_t = v(s_t)$ : delivered as RGB(D) images, potentially with instance masks/IDs at training time only.
A discrete action $a_t$ from action space $A$ (e.g., 3D gripper movement, grasp/release).
A memory state $m_t$ encoding object-specific historical summaries, updated from $(m_{t-1},o_t,a_t)$ .

The agent's process follows:

$\begin{align*} o_t &\sim O(s_t) \ a_t &\sim \pi(\cdot \mid m_t, o_t, \ell) \ s_{t+1} &= T(s_t, a_t) \ m_t &= f(m_{t-1}, o_t, a_t) \end{align*}$

where $\ell$ is the textual instruction and $O$ is many-to-one, enforcing that $o_t$ alone cannot recover $s_t$ due to visually identical objects. To address object-level ambiguity and maintain identity across occlusion or reordering, LIBERO-Mem employs $K$ slot representations via Slot Attention, binding slots $\{s_t^k\}_{k=1}^K$ to top- $K$ object regions and rolling forward their hidden states $h_t = [h_t^1, ..., h_t^K]$ .

2. Task Suite and Memory Dimensions

LIBERO-Mem consists of ten distinct, instruction-conditioned manipulation tasks organized by memory-relevant dimensions:

Task	Class	Description
T1	Object Motion	Pick up a bowl and place on its plate (short-horizon, ≈200 frames)
T2	Object Motion	Pick up a bottle and place on plate
T3	Object Sequence	Lift and place bowl on plate 3× (subgoal: lift→place ×3)
T4	Object Sequence	Lift and place bottle on plate 3×
T5	Object Sequence	Lift and place bowl 5×
T6	Object Sequence	Lift and place bowl 7×
T7	Object Relations	Swap Bowl 1 and Bowl 2 using empty plate
T8	Object Relations	Cyclically swap 3 bowls using one vacant plate
T9	Occlusion	Put bowl in nearest basket, then move that basket (now occupied) center
T10	Occlusion	Put bowl in basket, then move empty basket center

T1–T2 test short-horizon object recall. T3–T6 escalate temporal length and sequence tracking (up to ≈700 frames). T7–T10 introduce relational dependencies and occlusion, where only object memory—rather than instantaneous vision—can resolve ambiguities. Subgoal flags are provided for every timestep, enabling detailed subgoal-aware evaluation.

3. Distinguishing Design Principles

LIBERO-Mem diverges from prior benchmarks by:

Explicitly contravening the Markov property: Past LIBERO benchmarks [NeurIPS 2023] evaluate on Markovian tasks resolvable by frame-level policies; LIBERO-Mem’s visually identical assets and history-dependent subgoals preclude such solutions.
Object-level partial observability: Objects share pixel statistics but differ only by their cumulative interaction histories.
Subgoal ambiguity: Correct actions hinge on the cumulative count or ordering of prior slot interactions.
Temporal scaling: Tasks require propagation of identity and action traces across hundreds of steps.
Subgoal-aware annotation: Framewise subgoal flags allow precise, structured evaluation of temporal and relational dependencies.

Comparison with other benchmarks is summarized as follows:

Benchmark	Non-Markovian Obj.	Long Horizons	Subgoal-aware Eval	Identity Ambiguity	Temporal Scaling
LIBERO-Mem	✓	✓	✓	✓	✓
MemoryBench (25)	✓	✗	✓	✗	✗
MIKASA-Robo (25)	✗	✗	✗	✗	✗
LIBERO (23)	✗	✓	✗	✗	✗
RLBench (20)	✗	✓	✗	✗	✗

Only LIBERO-Mem provides all four stressors simultaneously (Chung et al., 14 Nov 2025).

4. Embodied-SlotSSM Architecture

The principal baseline is Embodied-SlotSSM, a scalable vision-language-action (VLA) architecture designed to operate under the temporal and object-centric pressures of LIBERO-Mem. Its design includes:

Slot-State-Space Model: Per-object slot hidden states are updated via a learned state-space recurrence:

$\begin{align*} h_t^k &= \bar{A}(s_t^k) h_{t-1}^k + \bar{B}(s_t^k) s_t^k \ y_t^k &= C(s_t^k)\, h_t^k \end{align*}$

where $s_t^k$ is the Slot Attention-derived embedding, and $\bar{A},\bar{B},C$ are block-diagonal over slots. Slot hidden states persist over time, summarizing short-term dynamics and facilitating long-term identity carry-over.

Relational Encoder: To fuse memory with current perception, $L$ relational tokens $\{r_t^j\}$ are generated by cross-attending between slot-fused features $d_t^k$ and the backbone output $v_t$ . The policy predicts actions by:

$\hat{a}_t \sim P_\theta \left(a_t \mid \{r_t^j\}_{j=1}^L,\, \{d_t^k\}_{k=1}^K,\, \ell\right)$

Temporal Scalability: Only the slot-aggregated memory is propagated across steps, avoiding the intractable token growth faced by vanilla VLA models even for task horizons of several hundred frames.

These mechanisms permit context-aware, temporally grounded action decoding under spatio-temporally ambiguous conditions.

5. Evaluation Protocol and Empirical Results

Performance on LIBERO-Mem is assessed using subgoal completion ratio: the fraction of all annotated subgoals completed, measured per 20 random seeds. For Markovian LIBERO-Goal tasks, the protocol uses overall task success rate.

Key results include:

On LIBERO-Goal tasks, naive Embodied-SlotSSM achieves 80.1% success; SlotVLA achieves 66.5% (8-frame horizon) and 32% (1-frame horizon).
On LIBERO-Mem, memory-less policies ( $\pi_0$ with 256 tokens, SlotVLA at $h=1$ or $8$) achieve only 5.0% mean subgoal coverage; naive Embodied-SlotSSM attains 14.8%, including 50% on T1 and 33.3% on T3.
Only LIBERO-Mem produces compound failure modes in the absence of history-aware representations: policies frequently repeat or skip actions, and fail to track which object or location is associated to each subgoal.

Empirically, end-to-end VLA policies that ignore memory collapse to Markovian “shortcut” policies that are highly brittle under ambiguous or temporally extended scenarios.

6. Relevance and Context in Current VLA Research

LIBERO-Mem directly addresses critical limitations in current VLA evaluation, as highlighted in recent work on rote memorization (Zhou et al., 4 Oct 2025). Standard LIBERO and related benchmarks allow trivial memorization of action sequences due to minimal variation between train/test splits. LIBERO-Mem’s object-centric, history-dependent design defeats such memorization: action mapping is non-injective in the space of observations, and state/action selection must be conditioned on persistent, identity-resolved interaction traces.

This suggests that progress measured on LIBERO-Mem can more faithfully capture advances in temporally and semantically grounded VLA architectures, providing new axes for diagnosis and ablation of memory representations, relational encoders, and slot-based policy mechanisms. A plausible implication is that architectural improvements validated on LIBERO-Mem are more likely to generalize to real-world scenarios exhibiting visual ambiguity, long-term dependencies, and partial observability.

7. Core Mathematical Formulations

The essential equations governing the slot-state-space and relational encoding procedures in LIBERO-Mem are as follows:

Slot-State-Space Model:

$h_t^k = \bar{A}(s_t^k)\, h_{t-1}^k + \bar{B}(s_t^k)\, s_t^k \ y_t^k = C(s_t^k)\, h_t^k$

with $s_t^k$ as slot embeddings, and $\bar{A}, \bar{B}, C$ block-diagonal per-object.

Relational Encoder and Action Distribution:

$\hat{a}_t \sim P_\theta \left(a_t \mid \{r_t^j\}_{j=1}^L,\, \{d_t^k\}_{k=1}^K,\, \ell\right)$

where $\{r_t^j\}$ are relational tokens constructed via cross-attention with slot features and the visual backbone output.

These formulations instantiate slot-centric, temporally robust memory mechanisms scalable to long temporal horizons and object-heavy scenes, enabling LIBERO-Mem to establish a uniquely diagnostic benchmark for memory-driven manipulation research (Chung et al., 14 Nov 2025).