Multi-Frame Radar/Lidar Fusion

Updated 16 April 2026

Multi-frame radar/lidar fusion is a methodology that integrates temporally stacked sensor data to exploit radar’s velocity cues and lidar’s spatial accuracy.
It leverages multi-temporal radar returns to mitigate noise, compensate for sensor misalignment, and enhance object detection and trajectory forecasting.
Recent advances utilize probabilistic filtering, deep learning architectures, and dynamic-aware attention to address temporal asynchrony and improve real-time performance.

Multi-frame radar/lidar fusion is a family of methodologies for integrating temporally stacked radar and lidar data streams to exploit their complementary physical and statistical characteristics in automotive perception, particularly for object detection, tracking, trajectory forecasting, and scene flow estimation. Radar provides robust, long-range, velocity-resolved, but spatially coarse observations, while lidar delivers high-resolution geometric information often with poor instantaneous velocity estimation. Multi-frame fusion leverages multi-temporal radar returns to recover dense velocity cues and reduce noise via temporal redundancy, while aligning and integrating these with (possibly higher-frequency) lidar sweeps. Recent research advances span probabilistic filtering, deep learning–based early and late fusion architectures, dynamic-aware attention mechanisms, temporal misalignment compensation, and cross-modal calibration and synchronization techniques.

1. Fundamental Principles of Multi-Frame Radar/Lidar Fusion

Fusion of radar and lidar data exploits the respective strengths: lidar’s accurate spatial localization and radar’s direct velocity measurement (Hajri et al., 2018). Both modalities can be temporally buffered (multi-frame input) to mitigate sparsity and exploit temporal coherence. Radar’s temporal aggregation must address inter-frame misalignment caused by independently moving objects and sensor ego-motion (Peng et al., 14 May 2025). Fusion strategies must achieve precise spatial and temporal registration, enable robust object-level association, and be computationally efficient for real-time applications.

2. Probabilistic and Filter-Based Fusion Approaches

High-level object fusion is formalized via state-space models in tracking contexts. Each obstacle is parameterized as $x_k = [x, y, v_x, v_y]^T$ in the ego-vehicle frame (Hajri et al., 2018). The processes employ a constant-velocity prediction model with explicit process and measurement noise covariance, and assimilate both radar and lidar detections:

Measurement models for both sensors are linear and parameterized by empirically estimated covariance matrices from RTK ground-truth.
Data association between predicted tracks and asynchronous multi-sensor measurements is formulated as a global nearest neighbor (GNN) assignment, optimized with the Hungarian algorithm, using innovation-Mahalanobis distances as the association cost metric.
Fusion modules compensate for ego-motion between frames, aligning all spatial measurements and state estimates prior to association.
Validation gating (based on the $\chi^2$ quantile of Mahalanobis distances) is applied to reject improbable associations.
Real-time implementations demonstrate that fusion yields smoother, more accurate obstacle trajectories than either sensor alone, achieving, for example, $MSE_{v_x}=0.21\ \mathrm{(m/s)^2}$ for fused outputs versus $0.18$ (radar) and $0.33$ (lidar) (Hajri et al., 2018).

Observed limitations include immediate deletion of unassociated tracks, assumption of constant-velocity dynamics, and lack of explicit false alarm modeling. Proposed extensions include switchable motion models (IMM), track coasting, non-linear measurement integration (EKF/UKF), semantic enhancement, and multi-hypothesis association (Hajri et al., 2018).

3. Deep Learning Architectures for Spatio-Temporal Fusion

End-to-end fusion networks such as MoRAL (Peng et al., 14 May 2025), LiRaNet (Shah et al., 2020), and RaLiFlow (Fu et al., 11 Dec 2025) have advanced beyond probabilistic approaches, addressing the challenges of multi-frame radar/lidar fusion at the representation and feature fusion level.

MoRAL introduces a Motion-aware Radar Encoder (MRE) that processes stacked multi-frame radar point clouds, generates per-point motion masks, and compensates for inter-frame misalignment by shifting moving points according to their predicted velocities. This mitigates motion-induced artifacts in accumulated radar data. A Motion Attention Gated Fusion (MAGF) module injects radar-derived motion cues into LiDAR feature extraction, focusing attention on dynamic regions. The joint BEV feature is further processed for 3D object detection using the RLNet backbone, achieving $mAP=73.30\%$ overall and $AP=69.67\%$ for pedestrians (Peng et al., 14 May 2025).

LiRaNet encodes multi-frame radar using parametric convolution over irregular point sets, temporally aggregates features, and fuses them with multi-sweep lidar and HD maps at the feature level via channel concatenation and 1×1 convolution (Shah et al., 2020). This supports end-to-end joint learning for detection and trajectory prediction. Synchronization is performed by aligning each lidar sweep with its nearest radar sweep in time; further interpolation is not performed (Shah et al., 2020).

RaLiFlow addresses scene flow estimation by introducing the Dynamic-aware Bidirectional Cross-modal Fusion (DBCF) module. It employs bidirectional local cross-attention between radar and lidar BEV features, with Gaussian dynamic saliency maps derived from radar absolute radial velocity to bias attention in the vicinity of moving objects. The loss functions include pointwise errors, a radar confidence mask, and instance-level dynamic consistency (Fu et al., 11 Dec 2025). RaLiFlow demonstrates improvement over single-modality and prior fusion baselines with a 19.5% reduction in end-point error for LiDAR-side flow estimation (Fu et al., 11 Dec 2025).

4. Temporal Alignment and Asynchrony Handling

A core challenge in multi-frame fusion is temporal misalignment between sensor streams, due to differing frame rates (e.g., lidar at 20–25 Hz and radar at 4–15 Hz) and the inability to synchronize captures precisely (Xie et al., 2023). Timely fusion approaches, exemplified by adaptations of MVDNet, permit fusion at the higher frequency of the lidar:

The most recent radar frame is paired with each incoming lidar frame, yielding temporally unaligned pairs parameterized by a discrete offset $\Delta$ .
Training explicitly samples across possible temporal offsets to build robustness to asynchrony in the learned model.
Inference with such augmentation recovers synchronized-frame accuracy across offsets $\Delta=0\dots 5$ , e.g., $[email protected] \approx 0.876$ , matching the best possible per-offset models, while enabling object detection at up to 20 Hz (lidar rate) (Xie et al., 2023).
Additional strategies such as historical-frame skipping, multi-branch architectures for per-offset fusion, and data augmentation for temporal asynchrony further enhance performance and deployment efficiency (Xie et al., 2023).

A plausible implication is that explicit temporal augmentation during training is more critical than structural network changes for handling cross-modal asynchrony in modern deep fusion pipelines.

5. Motion-Aware and Dynamic-Focused Fusion Techniques

Recent advances emphasize the necessity of exploiting velocity and dynamic cues for robust fusion:

MoRAL’s MRE predicts a binary motion mask for every radar point and applies per-point motion compensation, with static and moving points handled differently when accumulated across multiple frames (Peng et al., 14 May 2025).
In RaLiFlow, a dynamic saliency map produced by aggregating radar points with $\chi^2$ 0 steers the local attention window, prioritizing information from dynamic regions and mitigating radar noise in static areas (Fu et al., 11 Dec 2025).
Loss functions and ablations consistently demonstrate that fusing pointwise radar-derived motion into the lidar branch or attention window yields improved detection and flow estimation, particularly on dynamic objects (e.g., AP improvement for pedestrians and cyclists, or $\chi^2$ 1 reduction in radar stream EPE) (Peng et al., 14 May 2025, Fu et al., 11 Dec 2025).
The design of dynamic consistency losses and confidence-masked flow losses is crucial to ensuring instance-level temporal coherence in both modalities.

6. Experimental Protocols and Quantitative Benchmarks

Datasets such as View-of-Delft (VoD) support frame-synchronized 4D radar and lidar, ego-vehicle odometry, and dynamic annotation, facilitating rigorous protocol definition (Peng et al., 14 May 2025, Fu et al., 11 Dec 2025). Typical experimental protocols include:

Multi-frame stacking (e.g., 5 radar sweeps at 0.1 s intervals, matched to lidar sweep history).
Ground removal, coordinate transformation, and ego-motion compensation.
Evaluation with mean average precision (mAP, AP per class, various IoU thresholds) for detection tasks, or end-point error (EPE, 3-way and 3D) for scene flow.
Comparisons include radar-only, lidar-only, and fusion system variants, as well as ablation for architectural and loss-function components.
Acceleration, latency, and frequency tradeoffs are reported; for example, MoRAL achieves $\chi^2$ 215.2 FPS on NVIDIA RTX 4070, with real-time fusion in embedded scenarios (Peng et al., 14 May 2025); MVDNet-based timely fusion supports up to 20 Hz inference with $\chi^2$ 31–2 AP drop compared to slower radar-bound operation (Xie et al., 2023).

Quantitative results demonstrate consistent gains for fusion over single-modal baselines on dynamic and long-range objects, with relative error reductions ranging from $\chi^2$ 4 (Shah et al., 2020) to $\chi^2$ 5 (Fu et al., 11 Dec 2025) depending on the regime and metric.

7. Current Limitations and Prospective Advancements

Present multi-frame radar/lidar fusion approaches are constrained by:

Non-robustness to severe occlusion or extended periods of missing detections (track deletion on miss, no explicit occlusion logic) (Hajri et al., 2018).
Performance degradation of simple motion models in highly dynamic scenes; lack of maneuver-adaptivity.
Sensitivity to radar noise, false alarms, and the sparsity of radar detections—addressed partially via learned attention and denoising but not fully solved (Fu et al., 11 Dec 2025).
Absence of explicit multi-hypothesis or probabilistic association tracking in deep fusion settings.
Computational cost and synchronization overhead as the number of stacked frames and historical context increases.

Potential research avenues include integration of advanced motion models (IMM), probabilistic multi-hypothesis data association, co-training with semantic scene understanding (e.g., camera features/CNN), and robust fusion under adverse sensing conditions. The incorporation of radar-derived motion cues in cross-modal attention and adaptive gating—especially to enhance detection and flow estimation in challenging, dynamic, or low-visibility scenarios—remains a primary area of activity (Peng et al., 14 May 2025, Fu et al., 11 Dec 2025).