Long-Context State-Space Video World Models

Published 26 May 2025 in cs.CV | (2505.20171v1)

Abstract: Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a novel block-wise scanning state-space architecture to maintain extended temporal video memory in video world models.
It integrates frame-local attention with diffusion forcing techniques to generate high-fidelity frames with consistent long-term memory.
Experimental results on Memory Maze and Minecraft benchmarks demonstrate superior performance in PSNR, SSIM, and LPIPS while maintaining constant inference time.

Long-Term Memory in State-Space Video World Models

Introduction

The paper "Long-Context State-Space Video World Models" addresses the challenge of maintaining long-term memory in video world models, especially in the context of interactive applications where real-time, infinite-length video generation is essential. Traditional video diffusion models encounter limitations due to the high computational costs associated with processing extended sequences and short attention memory spans, leading to inconsistent environments during interactive simulations. This research proposes a novel architecture leveraging state-space models (SSMs) to overcome such limitations while ensuring computational efficiency.

Proposed Methodology

State-Space Models with Block-Wise Scanning

The core innovation in this approach is a novel architecture design exploiting SSMs through a block-wise scanning strategy. By dividing spatial dimensions into blocks of size $(b_h, b_w, T)$ , the model strategically balances temporal memory and spatial coherence. This method contrasts with the existing models where all temporal tokens become distant in a linear scan, affecting long-term dependencies significantly. The block-wise scanning enhances expressivity and ensures each block handles temporal sequences efficiently, preserving more extended memory with constant inference time (Figure 1).

Figure 1: The architecture illustrates usage of SSM scanning in blocks, promoting extended temporal memory with spatial coherence.

Enhanced Attention Mechanisms

The study incorporates frame-local attention alongside SSMs, enabling bidirectional processing within individual frames while maintaining causal relationships across previous frames. This mixed approach helps in preserving high fidelity in frame generation and ensures temporal consistency (Figure 2).

Figure 2: An illustration of the improved training mechanism. Noised frames being trained against clean context frames to leverage long-term dependencies.

Long-Context Training and Inference

The training process addresses typical shortcomings of diffusion models capturing long-term dependencies by mixing standard diffusion forcing with an innovative approach—maintaining a random-length prefix of frames as clean context ensuring exposure to diverse noise levels across frames. This method not only enhances long-context learning but provides efficient inference with consistent memory usage by maintaining a tracking state for previously generated frames.

Experimental Evaluation

Experiments conducted over the Memory Maze and Minecraft datasets demonstrate the proposed method significantly surpasses baseline models like Mamba2 and traditional causal models in terms of maintaining temporal memory and offering high fidelity to ground truth frames. The spatial retrieval task, which demands accurate trajectory backtracking, showed the model's ability to maintain long-term memory efficiently (Figures 5 and 7).

Figure 3: High fidelity to ground truth in retrieval tasks, showcasing our model's superior memory over extended frames.

Figure 4: Consistent PSNR performance over increasing frame distances, indicating the model's robust long-term memory retention compared to limited context models.

Moreover, the proposed method demonstrates efficient task solving in speciated reasoning tasks, where it outperformed limited context causal transformers and maintained high statistical similarity metrics like SSIM, LPIPS, and PSNR even across challenging trajectories.

Performance and Scalability

Evaluation metrics of training and inference costs highlight the computational efficiency of the approach. By leveraging block-wise SSMs, the model achieves a balance between performance and resource consumption (Figure 5), with linear training complexity and constant inference time—the ideal scenario for interactive applications needing infinite rollouts.

Figure 5: Our model maintains consistent memory efficiency and computational time, clearly outperforming baseline methods concerning scaling.

Conclusion

This paper proposes a robust architecture advancing the performance of video world models by effectively managing long-term memory without incurring prohibitive computational costs. By introducing block-wise SSM and enhanced diffusion forcing techniques, it offers a comprehensive solution to issues with current state-of-the-art models. Future research could focus on higher resolution video applications and further optimizations to support interactive real-time frame generation.

Ultimately, "Long-Context State-Space Video World Models" offers a novel direction in addressing the resource and memory constraints of video diffusion models, laying groundwork that could benefit applications across gaming, simulations, and other interactive media technologies.

Markdown Report Issue