Dynamic Occupancy Grid Maps (DOGMs)
- DOGMs are a probabilistic framework representing spatiotemporal states in a 2D grid, incorporating occupancy, velocity, and uncertainty estimates.
- They fuse Bayesian filtering with CNN-based deep learning for sensor data integration, enabling precise dynamic object detection and prediction.
- DOGMs support scene segmentation, motion forecasting, and risk-aware planning while addressing challenges like class imbalance and sensor calibration.
A Dynamic Occupancy Grid Map (DOGM) is a cell-based probabilistic framework for representing the spatiotemporal state of the environment as perceived by autonomous agents, notably in robotics and autonomous driving. DOGMs model not only the static free/occupied composition of the environment but also per-cell dynamic quantities such as velocity and uncertainty, leveraging both Bayesian filtering and deep learning for probabilistic state estimation, perception, detection, and prediction.
1. Mathematical Foundation and State Representation
The DOGM formalism discretizes the vehicle’s local environment into a 2D grid. Each grid cell at time is associated with:
- Occupancy Probabilities/Belief Masses: for occupancy, for free-space, and an “unknown” or nonspecific mass via $1 - O(c, t) - F(c, t)$, often in Dempster–Shafer theory.
- Velocity Estimate: for east/north components.
- Velocity Covariance: for per-cell velocity uncertainty.
- Dynamic State: Total state vector , with per-cell uncertainty from Dempster–Shafer and covariance formalism (Hoermann et al., 2018).
Occupancy probability is typically aggregated as
and the joint grid state at time is a collection . For radar/lidar fusion and uncertainty quantification, more elaborate evidential or random finite set (RFS) representations are employed (e.g., cell-wise belief masses, multi-instance Bernoulli filtering) (Lampe et al., 2020, Nuss et al., 2016).
The Bayesian filtering loop per cell consists of
- Prediction: Incorporate dynamic models, typically constant velocity:
- Measurement Update: Integrate sensor measurements (e.g. lidar/radar). For a linear model:
with Kalman gain and measurement/transition operators and covariances . The occupancy update employs Dempster–Shafer evidence combination in practice (Hoermann et al., 2018, Stumper et al., 2018, Gies et al., 2018).
2. Perception and Deep Learning—Object Detection from DOGMs
DOGMs serve as structured inputs for deep object detection and scene understanding:
- Input Encoding: For each grid cell, standard features are stacked: , velocities , variances , covariance ; concatenated into a tensor.
- CNN-Based Detection: U-Net–style encoder–decoder networks with skip connections process the grid to generate bounding box hypotheses (center, width, length, orientation, confidence) per cell using spatial anchors over multiple scales and orientations. Detection heads predict per-anchor IoU, width/length/angle offsets; multi-task losses are balanced for class imbalance—dynamic cells are rare versus static background (Hoermann et al., 2018).
- Loss Functions: The per-output loss includes spatial weighting with as a background/foreground class ratio (e.g., ≈400), and exponent for re-weighting within objects; all outputs contribute to the total loss.
Advanced versions extend to multi-task architectures to directly regress occupancy, velocity, semantic classes, and drivable area in a single pass, using a combination of regression (MSE) and classification (cross-entropy/focal) objectives, often with explicit temporal modeling via recurrent units (ConvLSTM) (Schreiber et al., 2022, Schreiber et al., 2020).
3. Label Generation and Ground Truth for DOGM-Based Learning
Training deep networks on DOGMs requires high-quality labels. Manual annotation is infeasible at scale, so offline automatic two-pass object extraction is standard:
- Forward Pass: Initiate object tracings at cells with high occupancy and velocity; grow connected components using velocity-similarity and spatial edges; fit rectangles to clusters; predict future positions.
- Backward Pass: After the causal pass, refine trajectories and shapes by tracing backward from last visibility, correcting poses during occlusions/fragmentation.
- Outlier Rejection and Postprocessing: Spline-smoothing of trajectories, removal of spurious objects via context (e.g., OpenStreetMap filtering for mirrored/ghost objects); enforcement of size/kinematic plausibility (Stumper et al., 2018, Hoermann et al., 2018).
This pipeline achieves low false positive rates with ≈5% missed detections, forming the basis for CNN training and benchmarking (Hoermann et al., 2018).
4. Sensor Modalities and Fusion in DOGMs
DOGMs are sensor-agnostic but have been developed with both lidar and radar inputs:
- Lidar-Driven DOGMs: Classical geometric inverse sensor models (ISMs) ray-trace returns, associate occupancy with measured points, and carve freespace in between, but are brittle across platforms and overestimate unknown/occluded areas. End-to-end deep learning–based ISMs overcome this by learning to map raw BEV lidar tensors to per-cell occupancy using temporal context—yielding superior object shape, coverage, and freespace representation (Schreiber et al., 2022).
- Radar-Driven DOGMs: To leverage high-resolution Doppler and robustness, radar-centric DOGMs modify the ISM, update, and particle assignment, using radar-specific sectors, RCS-based weighting, range-rate for dynamic-state hypothesis, and large angular spread for FOV modeling. Bayesian updates fuse the prior state with measurement-derived per-cell probability vectors. Deep learning–based ISMs trained against lidar-derived labels further improve static/dynamic separation, particularly for slow-moving or low-RCS targets (Ronecker et al., 2024, Ronecker et al., 2024, Diehl et al., 2020, Wei et al., 2023).
- Sensor Fusion and Multi-Vehicle Aggregation: DOGMs are naturally fused cell-wise via Dempster’s rule when fusing multiple vehicles' or sensors' evidential grid maps (cloud-based collective models), allowing reduction of per-cell Shannon entropy and m(Θ), thus expanding the confident free/occupied area and supporting cooperative vehicle scenarios (Lampe et al., 2020).
5. Applications: Scene Understanding, Prediction, and Planning
DOGMs provide a unified backbone for downstream tasks:
- Scene Segmentation and Multi-Object Tracking: Clustering cell-level dynamic occupancy by spatial and velocity proximity (often with DBSCAN or graph-based methods) extracts individual object tracks. Fusion with filter-based multi-object trackers in a common hypothesis framework yields superior continuity and lower error (Gies et al., 2018, Rexin et al., 2019).
- Motion Prediction and Multimodal Forecasting: DOGMs serve as inputs to probabilistic forecasting architectures (e.g. ConvLSTM, CVAE, variational autoencoders) that predict the future evolution of grid cell occupancies and velocities, allowing sampling of full scene futures. Joint semantic and flow heads enable warped auxiliary supervision and multi-object retention analysis over long horizons (Asghar et al., 2023, Asghar et al., 2024, Xie et al., 2022).
- Planning and Risk Assessment: Uncertainty-aware DOGMs, especially those outputting per-cell entropy or probabilistic occupancy, are integrated into predictive planning pipelines (e.g., DWA), offering soft cost layers for obstacles and risk-aware navigation under dynamic uncertainty (Xie et al., 2022).
6. Implementation, Quantitative Benchmarks, and Limitations
Typical implementation parameters include 0.15–0.2 m grid resolution, million-scale cell counts, and near real-time update rates (~20–30 Hz). Particle count and bandwidth are tuned for sensor modality and required accuracy (Diehl et al., 2020, Chen et al., 2022).
Quantitative results across studies include (for object detection):
- Average Precision (AP) near 75.9% (IoU-based, lidar, stationary) (Hoermann et al., 2018)
- Grid-level mIoU up to ~93% on static/dynamic masks; velocity EPE ≈ 0.01 m/s for deep RNN-based methods (Schreiber et al., 2020)
- Radar-centric detection: Car AP from 19% (radar-only) up to 27% with deep learning–driven fusion (Ronecker et al., 2024); mIoU improvement on dynamic grid cells from 17.4% to 20.1%
Advantages include principled handling of dynamic/static distinction, uncertainty, sensor fusion, temporally smooth object proposals, and compatibility with both analytic Bayes filtering and deep learning.
Limitations cited are class imbalance (rare dynamic objects versus static background), dependence on sensor modality and calibration, residual susceptibility to ghost targets, moderate precision in radar-only regimes, and the necessity for offline batch processing in high-quality label pipelines. Extensions under exploration include improved cell clustering, direct end-to-end learning of the inverse sensor model, full 3D continuous-space inference, and combined detection–forecasting modules with learned semantics (Ronecker et al., 2024, Chen et al., 2022, Asghar et al., 2024).
7. Future Research Directions
Current work addresses:
- End-to-end differentiable DOGM/state estimation (Ronecker et al., 2024)
- Multi-class and multi-modality (combining radar, lidar, camera) DOGM pipelines (Asghar et al., 2024, Schreiber et al., 2022)
- Stochastic prediction with uncertainty-calibrated forecasts for risk-sensitive planning (Xie et al., 2022)
- Continuous-space DOGM representation and efficient RFS-based Bayesian updates (Chen et al., 2022)
- Instance segmentation and dynamic/semantic fusion on grid maps for robust tracking and prediction (Asghar et al., 2023, Asghar et al., 2024)
- Cloud-based collective DOGM fusion for connected vehicle networks (Lampe et al., 2020)
The DOGM framework thus represents a core methodology for integrated, uncertainty-aware, and scalable environment modeling in the perception and decision stacks of autonomous robotic systems.