Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Published 17 Jun 2026 in cs.CV and cs.RO | (2606.18960v2)

Abstract: Action-conditioned world models have emerged as a promising paradigm for robot learning, offering a scalable alternative to costly real-world experimentation by generating action-consistent video rollouts. However, persistent world modeling remains challenging in manipulation: frequent end-effector occlusions and rapid wrist-camera motion make the current observation insufficient for predicting future views, causing models to forget or hallucinate scene details seen in earlier frames. Existing memory retrieval strategies often fail to identify informative history in dynamic manipulation scenarios. To address this limitation, we propose Mem-World, a memory-augmented multi-view action-conditioned world model. At its core, we present W-VMem, a 4D wrist-view-centered surfel-indexed memory that anchors historical observations to temporally evolving surface elements. By explicitly modeling when and where scene elements are observed, W-VMem enables geometry-aware retrieval of relevant history frames conditioned on future actions. During generation, relevant history frames are selected via surfel-based rendering and scoring, providing informative and non-redundant context for prediction. Extensive experiments show that Mem-World generates persistent rollouts in complex manipulation scenarios, enables more reliable policy evaluation than Ctrl-World, improving the Pearson correlation with real-world performance by 14.5\%, and supports effective policy improvement through synthetic data generation, increasing success rates from 58\% to 72\% on long-horizon tasks.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces a novel 4D wrist-view surfel-indexed memory system (W-VMem) that improves temporal consistency in robot manipulation tasks.
It leverages geometry-aware retrieval conditioned on future actions to generate realistic synthetic rollouts, outperforming baseline methods in occlusion handling.
Experimental evaluations reveal a strong correlation with real-world policy performance, boosting long-horizon task success rates from 58% to 72%.

Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation

Introduction

Action-conditioned world models have become central to robot learning and long-horizon policy development. The Mem-World framework addresses key shortcomings in prior approaches related to temporal persistence and occlusion handling, particularly for manipulation tasks where wrist-mounted camera views experience frequent egocentric motion and occlusions. The core innovation is W-VMem, a 4D wrist-view-centered surfel-indexed memory system that connects historical visual evidence to temporally evolving surface elements, allowing geometry-aware retrieval of relevant context frames conditioned on future actions, thereby stabilizing the prediction of scene content across extended, complex trajectories. Mem-World's design facilitates more effective policy evaluation, improves temporal consistency in synthetic rollouts, and enables robust policy improvement through generated data.

Figure 1: Pipeline of Mem-World, illustrating wrist-view 4D surfel memory maintenance, retrieval of geometry-aware history frames, action-conditioned video prediction, and iterative memory updates.

Methodology

Mem-World consists of two tightly coupled modules: the W-VMem surfel-indexed memory, and an action-conditioned multi-view world model. The system initiates with a geometric reconstruction from multi-view inputs using a point map estimator, generating surfels annotated with position, orientation, radius, temporal markers, and task relevance flags. As manipulation unfolds, W-VMem updates are restricted to wrist-view frames, maximizing manipulation-centric context and preserving temporal associations. Future wrist-camera poses are computed from joint actions via kinematics, and surfel rendering occurs in the reference point cloud frame.

For action-conditioned rollout, the world model conditions its predictions on top-K retrieved historical frames, scored by geometric visibility (alignment and depth), binary manipulation relevance (object-centric flag), and temporal recency decay. Non-maximum suppression ensures broad coverage and reduces redundancy. This retrieval and conditioning regime enables future prediction to remain grounded in pertinent, non-stale visual evidence, even when the current frame suffers from occlusion or incomplete information.

Experimental Evaluation

Temporal Consistency and Policy Correlation

Mem-World was evaluated on the DROID dataset with replayed trajectories exhibiting high occlusion rate and complex wrist-camera motion. Visual, computation-based (PSNR, SSIM, LPIPS), and model-based (DINOv2 object feature similarity) metrics confirm Mem-World's superiority in multi-view prediction, particularly for wrist-view cameras where prior models struggled to maintain persistent object appearances and mitigate hallucination or forgetting.

Figure 2: Mem-World demonstrates persistent scene modeling; occluded objects remain recoverable and reappear consistently after camera motion.

Ablation studies isolating memory retrieval confirm that geometry-aware strategies, specifically surfel-indexed retrieval, substantially outperform short-term and stride-based context sampling. The latter approaches frequently result in object distortion and loss during extended rollouts, as evidenced by lower metrics and qualitative degradation.

Figure 3: W-VMem geometric retrieval preserves object consistency in future frames, outperforming stride and short-term memory in preventing object disappearance or distortion.

Policy Evaluation and Improvement via Synthetic Rollouts

Correlation analysis shows a strong linear relationship ( $r=0.97$ ) between policy success rates observed in the Mem-World simulated rollouts and real-world executions, outperforming Ctrl-World ( $r=0.85$ ), especially on long-horizon tasks involving complex object-target configurations and loop-like wrist motions. Rollouts in Mem-World reflect realistic policy capabilities and failures, indicating its value as a reliable simulator for policy development and evaluation.

Figure 4: High quantitative correlation between Mem-World rollouts and real-world policy performance.

Utilizing Mem-World-generated successful synthetic trajectories for post-training policies results in substantial increases in real-world success rates (from 58% to 72%) for challenging long-horizon manipulation tasks, validating its effectiveness as a data engine for policy improvement.

Figure 5: Comparisons of $\pi_{0.5}$ rollouts across real-world, Ctrl-World, and Mem-World, highlighting superior persistence and object recovery in Mem-World.

Implications and Future Directions

The Mem-World architecture establishes a new baseline for memory-augmented robot world models, notably in its ability to preserve task-relevant information across severe occlusions, rapid camera changes, and dynamic interaction with objects. Its robust handling of manipulation-centric wrist views enables more faithful policy simulation and evaluation, facilitating safer policy iteration and deployment. The geometry-aware surfel-indexed memory is likely to generalize well to other domains with high temporal dynamics and occlusions.

However, limitations remain. Mem-World relies heavily on the informativeness of initial wrist-view frames and lacks explicit physics constraints, occasionally yielding physically inconsistent predictions (e.g., impossible grasps). Integrating physics-aware signals or multi-sensory input could further enhance fidelity. Improving multi-view consistency to leverage complementary viewpoints more effectively is another promising direction, potentially mitigating the information loss due to initial occlusions.

Conclusion

Mem-World advances action-conditioned world modeling for persistent, long-horizon robot manipulation by introducing a manipulation-centric memory system rooted in 4D surfel indexing and geometry-aware retrieval. Empirical results demonstrate marked improvements in temporal consistency, policy evaluation reliability, and policy performance enhancement through synthetic data. The approach underscores the necessity and utility of memory augmentation and geometric anchoring in world models for complex robotic manipulation.