4D BEV Representation for Autonomous Driving

Updated 21 November 2025

4D BEV representation is a spatio-temporal model that encodes sequential bird’s-eye-view features by integrating sensor data and ego-motion compensation.
It employs methods such as temporal fusion, spatio-temporal transformers, and occupancy forecasting to achieve robust scene understanding.
Applications include 3D object detection, semantic segmentation, and forecasting, yielding significant performance improvements over spatial-only methods.

A 4D Bird’s-Eye-View (BEV) representation encodes the spatial structure of the environment (typically as a top-down Cartesian grid across $x$ and $y$ ) jointly with temporal evolution (sequence of grids over time), forming a spatio-temporal tensor or sequence of BEV features. In modern autonomous driving research, 4D BEV representations unify information from multi-modal sensors (cameras, LiDAR, radar), leverage temporal context for improved scene understanding, enable forecasting of dynamic states, and serve as a compact, geometry-grounded latent for perception and planning. This spatio-temporal paradigm extends classical BEV by explicitly incorporating time, typically either as stacked frames or by integrating sequence modeling within the BEV latent itself.

1. Mathematical Foundations and Core Representations

Formally, a 4D BEV representation is structured as $Z \in \mathbb{R}^{T \times C \times H \times W}$ , where $T$ denotes the temporal dimension (number of frames in sequence), $C$ the feature channels, and $H, W$ the spatial discretization of the BEV plane. Each $x_t \in \mathbb{R}^{C \times H \times W}$ encodes the environment at time $t$ .

For occupancy-centric settings such as FSF-Net, the 4D BEV tensor extends to volumetric semantics: $\mathcal{O} \in \mathbb{Z}^{T \times H \times W \times L}$ , where $L$ indexes vertical voxels, and each entry is a semantic class or empty (Guo et al., 2024). BEVWorld formulates a compressed 4D space as a sequence of low-dimensional BEV tokens, $x_t \in \mathbb{R}^{4 \times H \times W}$ , supporting high-fidelity conditional scene rollouts (Zhang et al., 2024).

Multi-frame BEV features, e.g., in BEVDet4D, are temporally aligned across ego-vehicle poses via rigid planar transforms $T^{e(t)}_{e(t-1)} \in SE(2)$ , ensuring that spatial correspondence holds before fusion (Huang et al., 2022). Fused 4D BEV features can be formulated as

$F^{4D}(x, y) = \varphi([F_t(x, y) ; F'_{t-1}(x, y)]),$

where $F_t$ is the current frame's BEV map, $F'_{t-1}$ the ego-motion-aligned previous map, and $\varphi$ a convolutional fusion operator.

2. Methods for Constructing 4D BEV Representations

Contemporary methods for constructing 4D BEV representations fall into several major categories:

Temporal Fusion in BEV Latent Space: Approaches such as BEVDet4D (Huang et al., 2022) align and concatenate per-frame BEV features, fusing with lightweight convolutions. Ego-motion compensation is critical to prevent misalignment due to vehicle movement.
Joint Spatio-Temporal Transformers: BEVWorld (Zhang et al., 2024) employs a spatial-temporal transformer with alternating spatial and temporal self-attention, modeling $Z$ directly as a sequence. Temporal consistency is enforced in both prediction and back-projection (via rendering).
Dual-Latent Aggregation (Image/BEV): TempBEV (Monninger et al., 2024) integrates image-space temporal fusion (via optical flow or "temporal stereo" CNNs) and BEV-space self-attention. Fused representations are lifted into BEV via cross-attention mechanisms, combining short-term pixel-level motion cues with longer-term spatial context.
Occupancy Flow/Forecasting: FSF-Net (Guo et al., 2024) constructs a 4D semantic occupancy grid, forecasting future states by warping current occupancy along BEV scene flow fields and predicting future latent codes via a vector-quantized Mamba backbone. Fused outputs are refined by a 3D U-Net.
Multi-Modal Sensor Fusion: SFGFusion (Li et al., 22 Oct 2025) combines camera, 4D radar, and pseudo-point clouds in BEV using surface fitting, pillarization, and learned depth distributions. Each modal stream is mapped to separate BEV features and fused convectively or via concatenation.

Modern 4D BEV approaches emphasize holistic multi-sensor fusion:

Camera-Centric: BEVDet4D and TempBEV primarily operate on multi-view camera inputs. Each camera’s features are “lifted and splatted” into the BEV, with cross-view warping ensuring spatial unification (Huang et al., 2022, Monninger et al., 2024).
LiDAR and Radar Fusion: BEVWorld pillarizes LiDAR returns, processes them via transformer backbones, and fuses with image features through deformable attention (Zhang et al., 2024). SFGFusion addresses severe 4D radar sparsity by generating dense pseudo-point clouds from surface-fitted image regions, complementing radar-only pillars (Li et al., 22 Oct 2025).
Tokenization/Fusion Mechanisms: Deformable attention modules, concatenation-convolution operations, and learned gating (as in FSF-Net's U-Net quality fusion) are commonly deployed for spatial and modality fusion (Guo et al., 2024, Li et al., 22 Oct 2025, Zhang et al., 2024).
Ego-Motion Compensation: Accurate temporal aggregation depends on aligning features across frames given the vehicle’s motion. SE(2) transforms (yaw, translation) are standard for 2D BEV fusion (Huang et al., 2022).

4. Temporal Modeling, Forecasting, and Latent Dynamics

Temporal modeling in 4D BEV representations addresses both near-term fusion and long-term forecasting:

Explicit Stacking and Fusion: Concatenation of temporally-aligned BEV features enables direct exploitation of short-range temporal context (as in BEVDet4D and TempBEV).
Latent Sequence Models: BEVWorld employs latent diffusion models for multi-step scene prediction, conditioning on historical BEV tokens and action parameters (speed/steer). The reverse process is constructed via a transformer parameterized generative process (Zhang et al., 2024).
Occupancy Warping and Flow-Based Prediction: FSF-Net extracts and applies coarse BEV scene flow, computed from the alignment of max-height maps across frames, to predict future 3D occupancies (Guo et al., 2024). This approach leverages the observation that vertical dynamics are minimal in driving scenarios.
Score-Matching, VQ-VAE, and UQF Fusion: Probabilistic and latent-variable-based approaches (e.g., vector quantization in FSF-Net, score-matching in BEVWorld) yield temporally consistent and semantically rich predictions.

5. Applications and Benchmark Results

4D BEV representations are central to a diverse set of autonomous driving tasks:

3D Object Detection: BEVDet4D achieves significant NDS improvements over spatial-only baselines (+7.3% NDS; 54.5% on nuScenes), and reduces mean average velocity error by up to −63.4%, approaching the performance of LiDAR/radar-centric architectures (Huang et al., 2022).
Semantic Segmentation/Occupancy Forecasting: FSF-Net’s 4D BEV grid achieves a +10.87% mIoU and +9.56% IoU boost on Occ3D over OccWorld, with substantial gains in BEV-plane and volumetric segmentation (Guo et al., 2024). UQF fusion and VQ-Mamba latent predictors are essential for these improvements.
Scene Simulation and World Modeling: BEVWorld generates realistic future sequences with temporally consistent static and dynamic elements, supporting video/LiDAR synthesis (FID 19.0, FVD 154.0), improved 3D detection (+8.4% NDS), and enhanced planning (L2 error 1.030m→0.977m) (Zhang et al., 2024).
Multi-Modal Detection: SFGFusion demonstrates that explicit surface modeling enables reliable cross-modal BEV fusion, outperforming prior radar-camera fusion on TJ4DRadSet and VoD (Li et al., 22 Oct 2025).
Ablation Synergy: TempBEV demonstrates additive and synergistic benefits from joint temporal aggregation in image and BEV spaces, with a combined NDS improvement (+1.06 pts) exceeding the sum of isolation, confirming the complementary nature of multilevel temporal cues (Monninger et al., 2024).

Method	4D BEV Construction	Modalities	Main Benchmarks/Results
BEVDet4D	Concatenated/Aligned BEV	Cameras	NDS 54.5% nuScenes, −63.4% mAVE
BEVWorld	Latent Diffusion Seq.	Cameras/LiDAR	+8.4% NDS, FID 19.0 (video synth.)
FSF-Net	Flow+VQ Occupancy	LiDAR	+10.87% mIoU Occ3D
TempBEV	Dual Latent Fusion	Cameras	+1.06 NDS, +1.44 mAP nuScenes
SFGFusion	Surface Fit + Pillars	Radar/Cam	Leading 3D detection (VoD, TJ4DRadSet)

6. Technical Challenges and Model-Specific Insights

Temporal Misalignment: Inaccurate ego-motion compensation leads to spatial-temporal drift, degrading detection and forecasting quality. Rigid transform alignment in BEVDet4D directly addresses this (Huang et al., 2022).
Multi-Modal Sparsity: 4D radar provides velocity but suffers from severe sparsity. SFGFusion counteracts this by surface pseudo-points derived from dense per-pixel depth (Li et al., 22 Oct 2025).
Latent Space Compression: BEVWorld compresses fused features to as few as four channels per cell, retaining task-relevant semantics while enabling high-throughput temporal modeling (Zhang et al., 2024).
Long-Range Prediction: Diffusion-based forecasting (BEVWorld) and U-Net-fused warping (FSF-Net) enable temporally consistent, multi-frame anticipation without autoregressive error accumulation (Zhang et al., 2024, Guo et al., 2024).
Occupancy Label Fusion: Gated, class-frequency-weighted fusions (FSF-Net’s UQF) refine coarse ensemble predictions into fine-grained, class-aware volumetric segmentations (Guo et al., 2024).

7. Significance and Future Directions

4D BEV representations form the backbone of spatio-temporal perception and prediction in autonomous driving. They consolidate multi-temporal, multi-sensor information into a geometry-consistent latent that supports a wide range of downstream tasks: detection, occupancy segmentation, forecasting, data generation, and planning. Empirical evidence across benchmark suites demonstrates substantial gains over 3D/spatial-only BEV, notably for dynamic object tracking, velocity regression, and future occupancy prediction.

Further exploration will likely focus on deeper compression for latency-critical applications, robust large-scale multi-agent modeling, and tighter integration with planning/control stacks. The synergistic aggregation of temporal cues across multiple sensor spaces—as evidenced by TempBEV—is a persistent theme. The challenges of spatial-temporal misalignment, dynamic object complexity, and sparse/uncertain measurements will continue to drive research into more expressive latent models and novel fusion frameworks (Huang et al., 2022, Zhang et al., 2024, Guo et al., 2024, Monninger et al., 2024, Li et al., 22 Oct 2025).