RadarMOSEVE: Radar-Only MOS and EVE
- RadarMOSEVE is a radar-only method that employs spatial-temporal transformers to perform moving object segmentation and ego-velocity estimation using 4D radar point clouds and Doppler velocity.
- It integrates novel self- and cross-attention mechanisms to leverage both spatial and temporal cues, enhancing detection and speed regression accuracy under diverse conditions.
- Empirical results indicate superior performance with 70.2% MOS mIoU and 0.182 m/s EVE MAE, outperforming LiDAR-based and traditional radar processing methods.
RadarMOSEVE is a transformer-based method developed for radar-only moving object segmentation (MOS) and ego-velocity estimation (EVE) in autonomous mobile systems. Addressing the limitations of LiDAR-based approaches—namely expense and adverse weather sensitivity—RadarMOSEVE processes millimeter-wave radar (MWR) point clouds and leverages both spatial and temporal cues, including direct use of Doppler velocity. It constitutes the first published radar-only method to simultaneously achieve state-of-the-art performance for both MOS and EVE on diverse real-world datasets (Pang et al., 2024).
1. Problem Formulation and Input Representation
RadarMOSEVE operates on 4D radar point clouds, where each input at time is represented as
with encoding the 3D radar detection position and the measured radial (Doppler) velocity.
The framework tackles two tightly coupled tasks:
- Moving Object Segmentation (MOS): Classify each radar point as static or moving.
- Ego-Velocity Estimation (EVE): Regress the sensor platform’s forward speed .
For truly static points, the measured adheres to
where the negative sign reflects the Doppler convention for points in front of a forward-moving sensor.
2. Network Architecture and Attention Mechanisms
RadarMOSEVE employs a two-branch Spatial-Temporal Transformer, with a dedicated backbone for each task—EVE and MOS—while sharing novel radar-adapted self- and cross-attention modules.
2.1 Backbone Structure
The EVE branch comprises four feature extraction stages, successively downsampling the point cloud using farthest-point sampling (FPS) at rates , resulting in decreasing point counts .
At each stage, two attention mechanisms are deployed:
- Object Attention (OA): Local self-attention within a ball of radius for each query point, with randomly sampled neighbors.
- Scenario Attention (SA): Self-attention over a subsample of the global scene, implementing both farthest-point and interval sampling (spacing parameter ).
The terminal stage fuses temporal information via cross-attention between the present point set and a past point set ( frames).
2.2 Radar Self-Attention
Given neighbor set for point , the update is
where , , are shared linear projections; is an MLP; encodes positional information via an MLP on the coordinate offsets ().
2.3 Radar Cross-Attention
For current point with self-attended feature and neighbor points in the past frame, features are fused as
with stage-specific projections and positional encoding .
2.4 Temporal Aggregation
The network operates on two frames (current and lagged), achieving temporal reasoning through repeated bidirectional self- and cross-attention.
3. Exploitation of Radial Velocity
RadarMOSEVE incorporates the measured radial velocity directly as an additional input channel, forming . No explicit gating for velocity is used; rather, the attention mechanisms and positional encodings naturally exploit the 4D structure for neighborhood selection and feature aggregation.
Ablation studies confirm the significance of Doppler velocity: omission of reduces MOS mIoU from 65.6% to 61.1%, and increases EVE MAE from 0.182 m/s to 0.301 m/s.
4. Training Protocol and Objective Functions
Loss Functions
- EVE Loss:
where
penalizes discrepancies between predicted ego-velocity and static point Dopplers, and
enforces overall speed regression accuracy.
- MOS Loss:
with class weights compensating for class imbalance.
Training Dynamics
- Train EVE for 60 epochs, freeze weights, conduct velocity compensation, then train MOS for 50 epochs.
- Adam optimizer with weight decay , batch size 4, initial LR , decayed by 0.5 every 20 or 10 epochs (EVE/MOS).
5. Datasets, Annotations, and Evaluation
RadarMOSEVE introduces new benchmark annotations and datasets for radar-based MOSEVE:
- View-of-Delft (VoD):
3+1D radar pointclouds; authors re-annotated 3,000 frames using LiDAR-based moving labels plus manual correction.
- ORCA-UBOAT Radar Dataset:
13,654 frames of 4D radar from two platforms (ground vehicle, USV) over various scenarios, annotated by LiDAR/MOS cross-verification and with GNSS/INS ego-speed.
Table: Performance metrics (ORCA-UBOAT dataset)
| Method | MOS mIoU (%) | EVE MAE (m/s) | EVE [email protected] m/s |
|---|---|---|---|
| ICP | 25.2 | 0.842 | 25.2% |
| RANSAC | 32.6 | 0.601 | 49.6% |
| Point-Transformer | 54.8 | 0.330 | 76.5% |
| 4DMOS | 60.8 | – | – |
| 4DMOS+V | 61.7 | – | – |
| RadarMOSEVE | 70.2 | 0.182 | 94.3% |
Comparable gains are observed on VoD against RaFlow, CMFlow, and Gaussian-RT. Ablations indicate object attention (OA), scenario attention (SA), and cross-attention (CA) each confer 4–8% mIoU in MOS and 0.02–0.05 m/s in MAE.
6. Limitations and Prospects for Future Work
RadarMOSEVE's reliance on static-background returns limits its ability to disentangle ego-motion from moving-object motion in scenes where all observed points are dynamic. Sparse detection returns, especially for small or fast-moving objects, can yield false negatives. Potential enhancements include multi-modal fusion (camera/LiDAR), and extending the temporal context beyond two frames to mitigate ambiguities in motion attribution. These avenues could further improve robustness to real-world complexities.
7. Significance and Impact
RadarMOSEVE demonstrates that radar-only methods, leveraging spatial-temporal transformers and explicit Doppler integration, can achieve state-of-the-art MOS and EVE from sparse 4D radar, attaining 70.2% MOS mIoU and 0.182 m/s EVE MAE on diverse, annotated datasets. This presents a cost-effective, weather-resilient alternative for autonomous navigation and perception, particularly for environments and conditions unfavorable to optical sensors (Pang et al., 2024).