4D Occupancy: Dynamic Scene Modeling
- 4D occupancy is a spatiotemporal representation that maps the evolution of occupancy states in 3D space over time with semantic labels.
- It leverages raw sensor data from LiDAR, radar, and cameras to form dynamic voxel grids or continuous functions for efficient scene reconstruction.
- Advanced forecasting techniques, including grid-based, diffusion, and sparse query models, yield improved IoU scores and support real-time applications.
A 4D occupancy representation encodes the time-evolving state of occupancy over a 3D spatial domain, producing a function or tensor that maps spatiotemporal coordinates to occupancy probability or semantic labels. In contemporary research, this paradigm is central to scene understanding, forecasting, planning, and video generation in autonomous systems. The 4D occupancy field unifies space (x, y, z) and time (t), enabling models to reason about dynamic environments, actionable predictions, and consistent cross-modal understanding.
1. Mathematical Formalizations and Core Representations
4D occupancy fields are typically cast as either discrete tensors or continuous functions over . The most prevalent instantiation discretizes space-time into a grid, yielding
where indicates occupancy at spatial cell and time ; semantically labeled settings extend the codomain to for classes (Liu et al., 20 May 2025, Kreutz et al., 2022, Guo et al., 24 Sep 2024).
Continuous approaches model
for probabilistic occupancy (Yang et al., 14 Dec 2025). Many pipelines further encode semantics, flow fields, or instance identifiers, e.g. panoptic occupancy (Chen et al., 11 Mar 2025).
Sparse‐query methods dispense with fixed grids, instead representing the scene via a set of dynamic queries , supporting efficient continuous occupancy inference and forecasting (Dang et al., 20 Oct 2025).
2. Construction from Raw Sensor Modalities
Raw point cloud (LiDAR, radar), camera images, or 4D radar tensors are projected or lifted into the occupancy field:
- LiDAR: Point clouds are voxelized into the grid; time series are constructed by updating each voxel's history (Kreutz et al., 2022, Khurana et al., 2023).
- Radar: 4D radar returns are voxelized over , or directly encoded as a 4D tensor (Liu et al., 20 May 2025, Ding et al., 22 May 2024). Doppler and beam-specific descriptors may be used to capture velocity cues.
- Camera: Multi-view images are processed through CNN/FPN backbones, lifted via frustum-based depth or LiDAR supervision, and aggregated with ego-pose alignment to form motion-aware volumes (Ma et al., 2023, Chen et al., 11 Mar 2025, Chen et al., 21 Feb 2025).
Downstream models often employ VQ-VAE tokenization (Wang et al., 30 May 2024), tri-plane compression (Xu et al., 10 Mar 2025, Yang et al., 14 Dec 2025), or BEV-centric fusion (Yang et al., 26 Aug 2024) to obtain tractable, informative representations.
3. Model Architectures and Forecasting Methodologies
Occupancy forecasting is broadly approached via:
- Grid-based forecasting: CNN or Transformer-based encoder-decoders directly predict future occupancy grids (Zheng et al., 2023, Kreutz et al., 2022, Guo et al., 24 Sep 2024).
- Diffusion-based models: Spatial-temporal diffusion transformers predict future occupancy tokens or continuous latents, supporting trajectory-conditioning and long-horizon sampling (Wang et al., 30 May 2024, Gu et al., 14 Oct 2024).
- Triplane/tri-plane transformers: 3D grids are encoded to triplane latents, with temporal prediction at the triplane level, enabling fine-grained, real-time forecasting (Xu et al., 10 Mar 2025, Yang et al., 14 Dec 2025).
- Sparse query-based methods: Occupancy is represented and predicted via a dynamic set of queries whose locations and features can be adaptively regressed, rather than predicted via grid classification (Dang et al., 20 Oct 2025).
- Scene flow and warping approaches: For efficient temporal modeling, decoupled dynamic flow (voxel flow) warps dynamic objects, while static backgrounds are transformed via ego-motion mapping, sharply reducing the number of predicted variables (Zhang et al., 18 Dec 2024, Guo et al., 24 Sep 2024).
Many modern pipelines integrate self-supervision, multi-stage contrastive or reconstructive objectives, and specialized modules such as motion-conditioned normalization (Yang et al., 26 Aug 2024), attention-based query pooling (Chen et al., 11 Mar 2025, Dang et al., 20 Oct 2025), or image-assisted volume rendering (Zhang et al., 18 Dec 2024).
4. 4D Occupancy for Planning, Tracking, and World Modeling
The space-time occupancy paradigm is foundational for:
- Scene prediction and motion planning: Action-conditional rollouts, occupancy-based cost functions, and explicit path evaluation on predicted occupancy maps yield robust, physics-constrained planners (Yang et al., 26 Aug 2024, Zheng et al., 17 Dec 2025, Yang et al., 14 Dec 2025).
- Tracking and panoptic segmentation: 4D panoptic occupancy assigns semantic labels and temporally consistent instance IDs for every voxel, enabling dense object tracking and temporal association (Chen et al., 11 Mar 2025).
- General world models and video synthesis: Generative diffusion models conditioned on 4D occupancy representation can produce photorealistic, physics-consistent robot or driving videos, with 4D occupancy providing the geometric and semantic constraints for video generators (Yang et al., 14 Dec 2025, Yang et al., 3 Jun 2025).
- Risk and safety estimation: 4D Risk Occupancy augments occupancy with a continuous risk variable, enabling the formulation of risk-aware planners and the quantification of safety redundancy (Chen et al., 14 Aug 2024).
Notably, proactive forecasting using user-specified future action sequences has emerged as a new evaluation protocol, going beyond mere "what will happen next" to "what would happen if action A is taken" (Zheng et al., 17 Dec 2025).
5. Quantitative Benchmarks and Empirical Impact
State-of-the-art 4D occupancy forecasting models have delivered consistent improvements across benchmarks:
- FSF-Net achieves volumetric IoU gains of +9.56% absolute over OccWorld and BEV mIoU gains of +12.1% on Occ3D 4D forecasting (Guo et al., 24 Sep 2024).
- T³Former attains 36.09% mIoU for 1–3 s prediction (vs. OccWorld-O 17.14%), 1.44× realtime speedup, and mean L2 planning error of 1.0 m (Xu et al., 10 Mar 2025).
- GenieDrive's tri-plane VAE yields 7.2% mIoU improvement and 20.7% reduction in video FVD over predecessor methods, with high-speed (41 FPS) inference (Yang et al., 14 Dec 2025).
- DOME's diffusion transformer offers 36% higher mIoU than OccLLaMA-O in 4D forecasting and maintains temporal coherence over 32-frame rollouts (Gu et al., 14 Oct 2024).
- OccSTeP's tokenizer-free, recurrent world model achieves a proactive semantic mIoU of 23.70% (+6.56 pp), highlighting robustness under perturbations (Zheng et al., 17 Dec 2025).
- SparseWorld delivers a ∼7× speedup over grid-based methods while attaining the highest mIoU/IoU on Occ3D-nuScenes with only queries (Dang et al., 20 Oct 2025).
- For 4D risk occupancy-based planning, safety redundancy improves by 12.5% and average deceleration required in emergencies decreases by 5.41% (Chen et al., 14 Aug 2024).
6. Extensions, Modalities, and Applicative Scope
The 4D occupancy field is modality-agnostic:
- Radar-based 4D occupancy is robust to adverse weather and, via LiDAR-pseudo supervision or direct 4DRT modeling, yields near-LiDAR accuracy (Liu et al., 20 May 2025, Ding et al., 22 May 2024).
- Camera-only 4D occupancy, with tailored architectures for multi-camera input, achieves state-of-the-art forecasting accuracy and efficiency, narrowing or surpassing the performance gap with LiDAR pipelines (Chen et al., 21 Feb 2025, Ma et al., 2023).
- Sim-to-real and multi-view transfer is enabled by occupancy-centric generation pipelines, leveraging the modality-invariance and physical faithfulness of 4D occupancy scaffolds (Yang et al., 3 Jun 2025).
Downstream uses include BEV segmentation, 3D instance-level flow, multi-object tracking, and physically plausible multi-view video synthesis. The representation’s persistence and adaptability have driven advances in robustness against frame drops, label corruption, and partial sensor input (Zheng et al., 17 Dec 2025).
7. Design Insights, Limitations, and Trends
Design principles fruitfully established include:
- Separation of dynamic and static prediction for efficiency and interpretability (Zhang et al., 18 Dec 2024).
- Self-supervised learning from raw sensor streams, minimizing annotation cost (Kreutz et al., 2022, Khurana et al., 2023).
- Sparse, adaptive, and query-based approaches to overcome grid inefficiency and enable flexible range adaptation (Dang et al., 20 Oct 2025).
- Cross-modal conditionality: Action-, trajectory-, or planning-conditional forecasting for controllability and downstream integration (Gu et al., 14 Oct 2024, Xu et al., 10 Mar 2025, Yang et al., 14 Dec 2025).
- Linear-complexity architectures to ensure scalability to high-resolution, real-time settings (Zheng et al., 17 Dec 2025).
Limitations include the tradeoff between resolution and tractability (bottlenecked by grid size in dense approaches), the complexity of handling rare or dynamically occluded objects, and the heavy compute cost of large diffusion models (Gu et al., 14 Oct 2024, Yang et al., 14 Dec 2025). Work on fully self-supervised, multi-agent, or uncertainty-cognizant world models remains ongoing.
References
- (Kreutz et al., 2022) Unsupervised 4D LiDAR Moving Object Segmentation in Stationary Settings with Multivariate Occupancy Time Series
- (Chen et al., 11 Mar 2025) TrackOcc: Camera-based 4D Panoptic Occupancy Tracking
- (Liu et al., 20 May 2025) 4D-ROLLS: 4D Radar Occupancy Learning via LiDAR Supervision
- (Wang et al., 30 May 2024) OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving
- (Khurana et al., 2023) Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting
- (Ding et al., 22 May 2024) RadarOcc: Robust 3D Occupancy Prediction with 4D Imaging Radar
- (Chen et al., 21 Feb 2025) OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner Framework
- (Zhang et al., 18 Dec 2024) An Efficient Occupancy World Model via Decoupled Dynamic Flow and Image-assisted Training
- (Yang et al., 26 Aug 2024) Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving
- (Guo et al., 24 Sep 2024) FSF-Net: Enhance 4D Occupancy Forecasting with Coarse BEV Scene Flow for Autonomous Driving
- (Gu et al., 14 Oct 2024) DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model
- (Yang et al., 14 Dec 2025) GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation
- (Zheng et al., 17 Dec 2025) OccSTeP: Benchmarking 4D Occupancy Spatio-Temporal Persistence
- (Ma et al., 2023) Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications
- (Chen et al., 14 Aug 2024) Risk Occupancy: A New and Efficient Paradigm through Vehicle-Road-Cloud Collaboration
- (Zheng et al., 2023) Technical Report for Argoverse Challenges on 4D Occupancy Forecasting
- (Yang et al., 3 Jun 2025) ORV: 4D Occupancy-centric Robot Video Generation
- (Dang et al., 20 Oct 2025) SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries
- (Xu et al., 10 Mar 2025) Temporal Triplane Transformers as Occupancy World Models