GEM: Generating LiDAR World Model via Deformable Mamba

Published 8 May 2026 in cs.CV | (2605.07326v1)

Abstract: World models, which simulate environmental dynamics and generate sensor observations, are gaining increasing attention in autonomous driving. However, progress in LiDAR-based world models has lagged behind those built on camera videos or occupancy data, primarily due to two core challenges: the inherent disorder of LiDAR point clouds and the difficulty of distinguishing dynamic objects from static structures. To address these issues, we propose GEM: a Generative LiDAR world model that leverages deformable mamba architecture, significantly improving fidelity and imaginative capability. Specifically, leveraging the structural similarity between sequential laser scanning and Mamba's processing mechanism, we first tokenize LiDAR sweeps into compact representations via a custom LiDAR scene tokenizer. After unsupervised disentanglement of tokenized features via a dynamic-static separator, a tri-path deformable Mamba is introduced to perform selective scanning and adaptive gating fusion over the disentangled features, leading to enhanced spatial-temporal understanding of the world evolution. Optionally, a planner and a BEV layout controller can be integrated to explore the model's capability for autonomous rollout and its potential to generate ``what-if" scenarios. Extensive experiments show that GEM achieves state-of-the-art performances across diverse benchmarks and evaluation settings, demonstrating its superiority and effectiveness. Project page: https://github.com/wuyang98/GEM.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces GEM, a generative LiDAR world model that uses a scene tokenizer and unsupervised dynamic-static disentanglement to advance future scene prediction.
It employs a tri-path deformable Mamba architecture combined with latent diffusion for controllable and efficient sequential generation with markedly improved Chamfer Distance.
Empirical evaluations on nuScenes and KITTI demonstrate significant gains in geometric fidelity and prediction stability, setting a new standard for LiDAR world modeling.

GEM: A Generative LiDAR World Model via Deformable Mamba

Introduction

The "GEM: Generating LiDAR World Model via Deformable Mamba" (2605.07326) addresses significant limitations that exist in current LiDAR-based world models, especially compared to their camera-based and occupancy-based counterparts. The core challenges in LiDAR-centric world modeling involve handling the unstructured nature of point clouds and the necessity for precise separation of dynamic entities from the static environment. The paper proposes GEM, a novel generative model that leverages the structural alignment between LiDAR’s sequential scanning process and the Mamba architecture, introducing a tokenized and disentangled latent representation that enables state-of-the-art performance in future scene prediction, controllable generation, and autonomous rollout.

Core Methodological Advances

LiDAR Scene Tokenizer with Mamba Alignment

GEM innovates on the representation bottleneck in LiDAR modeling by introducing a scene tokenizer specifically engineered on the Mamba sequence modeling architecture. Raw point clouds are projected onto range maps, which are optimally suited to Mamba's sequential processing. This stage yields a compact latent representation that enables substantial computational savings and preserves the geometric fidelity necessary for downstream tasks.

Unsupervised Dynamic-Static Disentanglement

A primary contribution is the explicit decomposition of latent features into dynamic and static components via an unsupervised mechanism. Rather than relying on annotated semantic cues, GEM utilizes temporal differences and averages at the feature level, exploiting the observation that dynamic entities manifest strong temporal variation, while static structures remain temporally invariant. This approach enhances the model's capacity for scene understanding without annotation overhead.

Tri-Path Deformable Mamba Architecture

Building upon these disentangled features, GEM employs a tri-path deformable Mamba architecture. Each of the three paths—generic, dynamic, and static—is realized through branch-specific scan paths and deformable feature aggregation, tailored via learnable offset predictors. This structure enables localized modeling of dynamic objects, background preservation, and global context integration in parallel. An adaptive gated attention mechanism fuses these representations, yielding a coherent and discriminative latent space for generation.

Latent Diffusion and Controllable Generation

GEM’s architecture is integrated into a latent diffusion framework for sequential generation, providing robustness and allowing for autoregressive or externally controlled rollouts. A planner module optionally predicts future ego status, and the model supports conditional generation based on control vectors such as BEV layouts, enabling scenario-guided synthesis and counterfactual reasoning within the driving world.

Empirical Results

The model demonstrates state-of-the-art quantitative and qualitative performance on nuScenes and KITTI Odometry across both 1s and 3s prediction horizons. In particular, GEM achieves a 81.1% reduction in Chamfer Distance (CD) on nuScenes 1s forecasts relative to the next best method, and exhibits optimal or near-optimal scores across multiple error and stability metrics. Additionally, GEM provides superior distributional fidelity as measured by Frechet feature distances (FRID, FSVD, FPVD), JSD, and MMD on multiple datasets, including outperforming leading methods in 4 out of 5 metrics on the large-scale KITTI-360 set. Inference speed is increased compared to strong baselines, and prediction stability is improved through explicit disentangling and tri-path processing.

Ablation Studies

Ablations validate all core contributions. The tri-path deformable Mamba consistently outperforms UNet, diffusion transformers, and non-deformable Mamba variants on all evaluation metrics. Removal of dynamic-static disentanglement or adaptive fusion mechanisms leads to substantial degradation in geometric accuracy and stability, supporting the necessity of each architectural pillar.

Practical and Theoretical Implications

By bridging the structural gap between LiDAR’s data acquisition mechanism and deep sequence architectures, GEM advances the paradigm of generative world models able to simulate diverse, high-fidelity driving scenarios with precise spatiotemporal resolution. The explicit dynamic-static separation results in improved interpretability and robustness, which are critical for downstream tasks such as trajectory planning, reactive and predictive control, and scenario-based closed-loop evaluation. The controllable generation and rollout capabilities address core limitations of prior works, paving the way for scalable safety validation, counterfactual scenario synthesis, and synthetic data augmentation.

From a theoretical perspective, the tri-path architecture demonstrates that leveraging physically grounded disentanglement within state-space models yields meaningful advances in both accuracy and controllability. The integration of diffusion-based sequential generation with unsupervised scene decomposition sets a template for future research in multimodal and multi-agent world modeling.

Future Directions

Further research can extend GEM to multi-modality by fusing camera and radar features, scaling to longer prediction horizons, and exploring finer control over scenario manipulation. The modular design facilitates integration with embodied AI agents and reinforcement learning pipelines for simulation-based policy optimization. Exploration of online continual learning and explicit uncertainty modeling within this architecture would further benefit deployment in dynamic operational design domains.

Conclusion

GEM establishes a new standard for LiDAR-based world models by combining structurally aligned tokenization, unsupervised dynamic-static disentanglement, and deformable sequential modeling via Mamba. It demonstrates compelling advantages in accuracy, interpretability, and controllable generation, positioning itself as a foundational framework for simulation and planning in autonomous driving and beyond.

Markdown Report Issue