Occupancy Flow Prediction in Dynamic Scenes

Updated 12 April 2026

Occupancy flow prediction is the task of jointly estimating spatial occupancy and flow fields to capture dynamic changes in scenes, proving essential in autonomous driving and robotics.
Key methodologies employ convolutional, recurrent, and transformer-based architectures with hierarchical decoders and differentiable warping to ensure temporal consistency and physical realism.
This approach underpins applications in smart infrastructure and self-supervised settings, enhancing planning, tracking, and scene reconstruction through accurate multi-horizon forecasting.

Occupancy flow prediction is the problem of jointly estimating both the spatial distribution of dynamic scene occupancy and the instantaneous or future motion (flow) of those occupancies within a temporally evolving scene. In the context of autonomous driving, robotics, and intelligent infrastructure, this task unifies geometric scene understanding with dynamic forecasting, providing a granular and temporally consistent representation of the environment that supports critical downstream planning, tracking, and interaction modeling. Occupancy flow field methods predict both dense occupancy grids (typically in 2D BEV or full 3D voxel space) and per-element flow fields, explicitly modeling the spatial evolution of occupied regions in a physically realistic, temporally coherent manner.

1. Occupancy Flow Prediction: Representations and Problem Formulation

Occupancy flow prediction formalizes the spatiotemporal scene state as a pair of fields: an occupancy map $O_t$ that indicates the probability of occupancy (or class) at spatial location (grid/voxel/cell) at time $t$ , and a flow field $F_t$ that specifies, for each location, the vector displacement of the occupant between adjacent time steps. In the BEV grid formulation prevalent in urban driving, each grid cell holds $O_t(x, y)$ and backward flow $F_t(x, y) = (u, v)$ , characterizing movements from $t$ to $t-1$ , while 3D formulations extend this to $O_t(i,j,k)$ and $F_t(i,j,k) \in \mathbb{R}^3$ over voxels (Wang et al., 31 Mar 2025, Murhij et al., 2024, Chen et al., 2024).

Forecasting is conducted either autoregressively or via direct multi-horizon prediction, yielding the tuple $(O_{t+1}, F_{t+1}), ...$ given historical scene context (past occupancy, flow, semantics, maps, images, etc.). Supervision uses binary cross-entropy or focal loss for occupancy, and $t$ 0 or $t$ 1 regression for flow, with differentiable warping—i.e., the predicted future occupancy is constrained to be consistent with the predicted flow applied to prior occupancy fields (Huang et al., 2022, Liu et al., 2022, Murhij et al., 2024).

2. Methodological Approaches

A wide spectrum of architectures and losses has been explored for occupancy flow prediction.

Convolutional and ConvLSTM-based architectures (e.g., CCLSTM (Lengyel, 6 Jun 2025), OFMPNet (Murhij et al., 2024), VectorFlow (Huang et al., 2022), STrajNet (Liu et al., 2022)) employ stacked convolutional and recurrent blocks, encoding temporal histories and decoding multi-step BEV occupancy and flow tensors.
Transformer-based and Hierarchical models (e.g., HOPE (Hu et al., 2022), HGNET (Chen et al., 2024), STCOcc (Liao et al., 28 Apr 2025), STrajNet (Liu et al., 2022)) leverage spatial and temporal attention to capture global context and multi-agent interactions, with hierarchical decoders handling multiscale and multitarget (flow/occupancy) prediction.
Self-supervised, differentiable-rendering approaches (Let Occ Flow (Liu et al., 2024), SelfOccFlow (Timoneda et al., 27 Feb 2026), OccFlowNet (Boeder et al., 2024)) replace expensive 3D annotation with 2D or photometric supervision, using differentiable volume or SDF rendering and unsupervised optical-flow cues.
Explicit attention to physical and statistical priors: Methods such as VoxelSplat (Zhu et al., 5 Jun 2025) project 3D Gaussians into 2D for additional camera-space losses, OAAL in ALOcc (Chen et al., 2024) and OA-SCA in STCOcc (Liao et al., 28 Apr 2025) utilize learned or occupancy-weighted attention to improve the handling of occlusions and sparsity.
Implicit Continuous Representations (e.g., Implicit Occupancy Flow (Agro et al., 2023)) represent occupancy-flow as continuous-space, continuous-time fields, queryable at arbitrary points, via global deformable attention over a latent BEV scene encoding.

The field has also seen developments in graph-based occupancy-flow prediction for facility or network settings (e.g., the GCLSTM approach for building-level OD and flow from WiFi logs in (Badu-Marfo et al., 7 Jul 2025)).

3. Supervisory Signals and Losses

Supervision in occupancy flow prediction is multifaceted, with losses tailored to the nature of the prediction targets:

Occupancy Losses: Binary or multi-class cross-entropy on voxel/grid occupancy logits, optionally augmented by focal loss (to address imbalance) or Lovász-Softmax loss for better calibration to IoU (Chen et al., 2024, Hu et al., 2022).
Flow Losses: Grid/voxel-wise $t$ 2 or $t$ 3 regression on flow vectors, typically masked to only penalize over occupied cells (Lengyel, 6 Jun 2025, Chen et al., 2024, Wang et al., 31 Mar 2025).
Warp Consistency (Trace) Losses: Differentiable warping (bilinear or trilinear) of prior occupancy using predicted flow, enforcing that warped occupancy matches the predicted or ground-truth future occupancy (i.e., "flow-grounded" consistency loss) (Hu et al., 2022, Murhij et al., 2024, Lengyel, 6 Jun 2025).
2D/3D Consistency and Prototypical Constraints: Projecting 3D predictions into 2D for additional pixel-space supervision (e.g., VoxelSplat (Zhu et al., 5 Jun 2025)), prototype-based semantic consistency across 2D/3D (Chen et al., 2024).
Self-supervised and Differentiable Rendering Losses: Utilizing projected depths, segmentation, or photometric consistency as weak supervision (Let Occ Flow (Liu et al., 2024), OccFlowNet (Boeder et al., 2024), SelfOccFlow (Timoneda et al., 27 Feb 2026)).
Latent Variable and Regularization Terms: KL divergence for uncertainty modeling in multi-future prediction (Asghar et al., 8 Feb 2026, Hu et al., 2022), eikonal and Hessian regularizers for SDF smoothness in self-supervised 3D (Timoneda et al., 27 Feb 2026).

Multi-head training objectives balance these terms, often with tunable weights to control their relative influence (Lengyel, 6 Jun 2025, Chen et al., 2024, Liao et al., 28 Apr 2025).

4. Empirical Advances and Benchmarks

Large-scale benchmarks (e.g., Waymo Open Dataset, nuScenes, Occ3D/OpenOcc, UniOcc) provide rigorous multi-task occupancy-flow prediction tasks, supporting fine-grained evaluation via metrics such as:

RayIoU and mIoU: Standard volumetric or BEV-based IoU over predicted vs. reference occupancy, with RayIoU focusing on depth ordering along rays (Wang et al., 31 Mar 2025, Zhu et al., 5 Jun 2025).
End-Point Error (EPE) and mAVE: Per-grid or per-voxel error between predicted and true flow displacements, mean Absolute Velocity Error for 3D scenes (Lengyel, 6 Jun 2025, Wang et al., 31 Mar 2025).
Flow-Grounded Metrics: Occupancy IoU or AUC after applying predicted flow to prior step occupancy, reflecting joint correctness of motion and occupancy estimation (Hu et al., 2022, Murhij et al., 2024, Huang et al., 2022).
Scene/Agent-level Recall and mAP: For dynamic agents or specific semantic categories, and soft/warped recall for occluded or unlabeled entities (Asghar et al., 8 Feb 2026).
GT-free Metrics: Plausibility of predicted object shapes, temporal consistency, and calibration, particularly where ground-truth is weak or synthesized (Wang et al., 31 Mar 2025).

Ablation studies consistently show that explicit flow prediction improves all standard metrics—from geometric IoU to temporal consistency—across both real and simulated datasets (Wang et al., 31 Mar 2025). Top methods (e.g., CCLSTM (Lengyel, 6 Jun 2025), OFMPNet (Murhij et al., 2024), HOPE (Hu et al., 2022), ALOcc (Chen et al., 2024), STCOcc (Liao et al., 28 Apr 2025)) set new state-of-the-art results, and self-supervised methods now approach or even match the performance of fully supervised pipelines (Liu et al., 2024, Timoneda et al., 27 Feb 2026, Boeder et al., 2024).

5. Model Design Innovations

Recent research has introduced a suite of architectural and algorithmic strategies uniquely adapted to the occupancy-flow domain:

Cascade and Multi-scale Decoders: Hierarchical refinement modules for coarse-to-fine occupancy and flow prediction in space and time (Hu et al., 2022, Liao et al., 28 Apr 2025).
Attention- and Prototype-Aided Lifting: Occlusion-aware adaptive lifting, soft trilinear filling, prototype-based semantic constraints, and cost-volume-based motion heads for improved feature robustness and efficient joint semantic-flow prediction (Chen et al., 2024, Liao et al., 28 Apr 2025).
Temporal Aggregation and Fusion: Temporal attention modules—such as BEV backward/forward attention (Liu et al., 2024), ConvLSTM/GRU cells (Lengyel, 6 Jun 2025, Hu et al., 2022, Asghar et al., 8 Feb 2026), and memory-based autoregressive decoders (Chen et al., 2024)—to better capture long-range and multi-modal dynamics.
Deformable and Global Attention: Cross-view deformable attention for high-resolution, multi-view 3D encoding (Let Occ Flow (Liu et al., 2024), STCOcc (Liao et al., 28 Apr 2025)), deformable/global implicit attention in continuous occupancy flow fields (Agro et al., 2023).
Unified and Efficient Inference: Sparse and instance-conditional models (e.g., query-based implicit field predictors (Agro et al., 2023), explicit sparsification in STCOcc (Liao et al., 28 Apr 2025)) support real-time and planner-integrated inference.

In practical terms, these innovations enable models that both match the dynamic complexity of the real world and operate with efficiency suitable for deployment in embedded perception and decision-making loops.

6. Applications and Extensions

Occupancy flow prediction has directly impacted perception, motion forecasting, planning, and scene reconstruction in several domains:

Autonomous Driving: Provides a unified scene representation for both static/dynamic geometry and agent motion, supporting robust tracking, intent inference, and interaction modeling (Wang et al., 31 Mar 2025, Liu et al., 2022).
Robotics and Smart Infrastructure: Used in indoor and campus-scale multi-agent mobility modeling, including origin-destination forecasting via graph-based models (Badu-Marfo et al., 7 Jul 2025).
Self-supervised Perception and Label Efficiency: Approaches such as OccFlowNet (Boeder et al., 2024), Let Occ Flow (Liu et al., 2024), and SelfOccFlow (Timoneda et al., 27 Feb 2026) provide alternatives to labor-intensive 3D annotation, enabling training on large-scale video/LiDAR/image data using only 2D or self-generated targets.

Extensions include continuous-time occupancy flow forecasting using neural ODEs (StreamingFlow (Shi et al., 2023)), multi-future modeling with variational decoders (Asghar et al., 8 Feb 2026), and future work on instance-level dynamic consistency, improved flow/semantic disentanglement, and further unsupervised or GT-free approaches.

7. Future Directions and Open Challenges

Despite rapid progress, critical challenges remain:

Occlusion Reasoning and Temporal Extent: Handling persistent and long-range occlusions, visible and speculative occupancy, over long time horizons or under sparse sensing.
Uncertainty Quantification and Multimodality: Extending single-valued flow paradigms to support meaningful multi-modal future hypotheses, integrating scene semantics and intent.
Joint Instance and Scene-level Consistency: Ensuring tracking/assignment across overlapping, deformable, or unrecognized entities.
Scalability and Efficiency: Efficient scaling to large-scale, densely populated environments, with fully real-time, memory- and compute-optimal inference pipelines.

Recent benchmarks (UniOcc (Wang et al., 31 Mar 2025)) and GT-free evaluation paradigms catalyze progress, while self-supervised, query-efficient, and planner-coupled models define the trajectory of future work in occupancy flow prediction.