- The paper introduces GEM, a generative LiDAR world model that uses a scene tokenizer and unsupervised dynamic-static disentanglement to advance future scene prediction.
- It employs a tri-path deformable Mamba architecture combined with latent diffusion for controllable and efficient sequential generation with markedly improved Chamfer Distance.
- Empirical evaluations on nuScenes and KITTI demonstrate significant gains in geometric fidelity and prediction stability, setting a new standard for LiDAR world modeling.
Introduction
The "GEM: Generating LiDAR World Model via Deformable Mamba" (2605.07326) addresses significant limitations that exist in current LiDAR-based world models, especially compared to their camera-based and occupancy-based counterparts. The core challenges in LiDAR-centric world modeling involve handling the unstructured nature of point clouds and the necessity for precise separation of dynamic entities from the static environment. The paper proposes GEM, a novel generative model that leverages the structural alignment between LiDARโs sequential scanning process and the Mamba architecture, introducing a tokenized and disentangled latent representation that enables state-of-the-art performance in future scene prediction, controllable generation, and autonomous rollout.
Core Methodological Advances
LiDAR Scene Tokenizer with Mamba Alignment
GEM innovates on the representation bottleneck in LiDAR modeling by introducing a scene tokenizer specifically engineered on the Mamba sequence modeling architecture. Raw point clouds are projected onto range maps, which are optimally suited to Mamba's sequential processing. This stage yields a compact latent representation that enables substantial computational savings and preserves the geometric fidelity necessary for downstream tasks.
Unsupervised Dynamic-Static Disentanglement
A primary contribution is the explicit decomposition of latent features into dynamic and static components via an unsupervised mechanism. Rather than relying on annotated semantic cues, GEM utilizes temporal differences and averages at the feature level, exploiting the observation that dynamic entities manifest strong temporal variation, while static structures remain temporally invariant. This approach enhances the model's capacity for scene understanding without annotation overhead.
Building upon these disentangled features, GEM employs a tri-path deformable Mamba architecture. Each of the three pathsโgeneric, dynamic, and staticโis realized through branch-specific scan paths and deformable feature aggregation, tailored via learnable offset predictors. This structure enables localized modeling of dynamic objects, background preservation, and global context integration in parallel. An adaptive gated attention mechanism fuses these representations, yielding a coherent and discriminative latent space for generation.
Latent Diffusion and Controllable Generation
GEMโs architecture is integrated into a latent diffusion framework for sequential generation, providing robustness and allowing for autoregressive or externally controlled rollouts. A planner module optionally predicts future ego status, and the model supports conditional generation based on control vectors such as BEV layouts, enabling scenario-guided synthesis and counterfactual reasoning within the driving world.
Empirical Results
The model demonstrates state-of-the-art quantitative and qualitative performance on nuScenes and KITTI Odometry across both 1s and 3s prediction horizons. In particular, GEM achieves a 81.1% reduction in Chamfer Distance (CD) on nuScenes 1s forecasts relative to the next best method, and exhibits optimal or near-optimal scores across multiple error and stability metrics. Additionally, GEM provides superior distributional fidelity as measured by Frechet feature distances (FRID, FSVD, FPVD), JSD, and MMD on multiple datasets, including outperforming leading methods in 4 out of 5 metrics on the large-scale KITTI-360 set. Inference speed is increased compared to strong baselines, and prediction stability is improved through explicit disentangling and tri-path processing.
Ablation Studies
Ablations validate all core contributions. The tri-path deformable Mamba consistently outperforms UNet, diffusion transformers, and non-deformable Mamba variants on all evaluation metrics. Removal of dynamic-static disentanglement or adaptive fusion mechanisms leads to substantial degradation in geometric accuracy and stability, supporting the necessity of each architectural pillar.
Practical and Theoretical Implications
By bridging the structural gap between LiDARโs data acquisition mechanism and deep sequence architectures, GEM advances the paradigm of generative world models able to simulate diverse, high-fidelity driving scenarios with precise spatiotemporal resolution. The explicit dynamic-static separation results in improved interpretability and robustness, which are critical for downstream tasks such as trajectory planning, reactive and predictive control, and scenario-based closed-loop evaluation. The controllable generation and rollout capabilities address core limitations of prior works, paving the way for scalable safety validation, counterfactual scenario synthesis, and synthetic data augmentation.
From a theoretical perspective, the tri-path architecture demonstrates that leveraging physically grounded disentanglement within state-space models yields meaningful advances in both accuracy and controllability. The integration of diffusion-based sequential generation with unsupervised scene decomposition sets a template for future research in multimodal and multi-agent world modeling.
Future Directions
Further research can extend GEM to multi-modality by fusing camera and radar features, scaling to longer prediction horizons, and exploring finer control over scenario manipulation. The modular design facilitates integration with embodied AI agents and reinforcement learning pipelines for simulation-based policy optimization. Exploration of online continual learning and explicit uncertainty modeling within this architecture would further benefit deployment in dynamic operational design domains.
Conclusion
GEM establishes a new standard for LiDAR-based world models by combining structurally aligned tokenization, unsupervised dynamic-static disentanglement, and deformable sequential modeling via Mamba. It demonstrates compelling advantages in accuracy, interpretability, and controllable generation, positioning itself as a foundational framework for simulation and planning in autonomous driving and beyond.