Papers
Topics
Authors
Recent
Search
2000 character limit reached

RadarMOSEVE: Radar-Only MOS and EVE

Updated 31 January 2026
  • RadarMOSEVE is a radar-only method that employs spatial-temporal transformers to perform moving object segmentation and ego-velocity estimation using 4D radar point clouds and Doppler velocity.
  • It integrates novel self- and cross-attention mechanisms to leverage both spatial and temporal cues, enhancing detection and speed regression accuracy under diverse conditions.
  • Empirical results indicate superior performance with 70.2% MOS mIoU and 0.182 m/s EVE MAE, outperforming LiDAR-based and traditional radar processing methods.

RadarMOSEVE is a transformer-based method developed for radar-only moving object segmentation (MOS) and ego-velocity estimation (EVE) in autonomous mobile systems. Addressing the limitations of LiDAR-based approaches—namely expense and adverse weather sensitivity—RadarMOSEVE processes millimeter-wave radar (MWR) point clouds and leverages both spatial and temporal cues, including direct use of Doppler velocity. It constitutes the first published radar-only method to simultaneously achieve state-of-the-art performance for both MOS and EVE on diverse real-world datasets (Pang et al., 2024).

1. Problem Formulation and Input Representation

RadarMOSEVE operates on 4D radar point clouds, where each input at time tt is represented as

Pt={piR4pi=[xi,yi,zi,vi]T,  i=1N}P_t = \{p_i \in \mathbb{R}^4 \,|\, p_i = [x_i,\,y_i,\,z_i,\,v_i]^T, \; i = 1\, \ldots N\}

with (xi,yi,zi)(x_i, y_i, z_i) encoding the 3D radar detection position and viv_i the measured radial (Doppler) velocity.

The framework tackles two tightly coupled tasks:

  • Moving Object Segmentation (MOS): Classify each radar point pip_i as static or moving.
  • Ego-Velocity Estimation (EVE): Regress the sensor platform’s forward speed vv.

For truly static points, the measured viv_i adheres to

v^i=v(yixi2+yi2+zi2)\hat v_i = -v \cdot \left( \frac{y_i}{\sqrt{x_i^2 + y_i^2 + z_i^2}} \right)

where the negative sign reflects the Doppler convention for points in front of a forward-moving sensor.

2. Network Architecture and Attention Mechanisms

RadarMOSEVE employs a two-branch Spatial-Temporal Transformer, with a dedicated backbone for each task—EVE and MOS—while sharing novel radar-adapted self- and cross-attention modules.

2.1 Backbone Structure

The EVE branch comprises four feature extraction stages, successively downsampling the point cloud using farthest-point sampling (FPS) at rates [1,4,4,1][1, 4, 4, 1], resulting in decreasing point counts [Np,Np/4,Np/16,Np/16][N_p, N_p/4, N_p/16, N_p/16].

At each stage, two attention mechanisms are deployed:

  • Object Attention (OA): Local self-attention within a ball of radius rr for each query point, with KK randomly sampled neighbors.
  • Scenario Attention (SA): Self-attention over a subsample of the global scene, implementing both farthest-point and interval sampling (spacing parameter gg).

The terminal stage fuses temporal information via cross-attention between the present point set PtP_t and a past point set PtaP_{t-a} (a=10a = 10 frames).

2.2 Radar Self-Attention

Given neighbor set QiQ_i for point xiRDx_i \in \mathbb{R}^{D}, the update is

yi=xjQsoftmaxj[δ(α(xi)β(xj)+ωij)](γ(xj)+ωij)y_i = \sum_{x_j \in Q} \mathrm{softmax}_j \left[ \delta\big(\alpha(x_i) - \beta(x_j) + \omega_{ij}\big) \right] \odot \left( \gamma(x_j) + \omega_{ij} \right)

where α\alpha, β\beta, γ\gamma are shared linear projections; δ\delta is an MLP; ωij\omega_{ij} encodes positional information via an MLP on the coordinate offsets (PE(pipj)\mathrm{PE}(p_i-p_j)).

2.3 Radar Cross-Attention

For current point pip_i with self-attended feature yiy_i and KK neighbor points pjp_j in the past frame, features are fused as

zi=yjYisoftmaxj[δ(α(yi)β(yj)+ϵij)](γ(yj)+ϵij)z_i = \sum_{y_j \in Y_i} \mathrm{softmax}_j \left[ \delta'(\alpha'(y_i) - \beta'(y_j) + \epsilon_{ij}) \right] \odot \left( \gamma'(y_j) + \epsilon_{ij} \right)

with stage-specific projections and positional encoding ϵij\epsilon_{ij}.

2.4 Temporal Aggregation

The network operates on two frames (current and lagged), achieving temporal reasoning through repeated bidirectional self- and cross-attention.

3. Exploitation of Radial Velocity

RadarMOSEVE incorporates the measured radial velocity viv_i directly as an additional input channel, forming [x,y,z,v][x, y, z, v]^\top. No explicit gating for velocity is used; rather, the attention mechanisms and positional encodings naturally exploit the 4D structure for neighborhood selection and feature aggregation.

Ablation studies confirm the significance of Doppler velocity: omission of viv_i reduces MOS mIoU from 65.6% to 61.1%, and increases EVE MAE from 0.182 m/s to 0.301 m/s.

4. Training Protocol and Objective Functions

Loss Functions

  • EVE Loss:

LEVE=Ldop+LmseL_{EVE} = L_{dop} + L_{mse}

where

Ldop=1Nsistaticv^yixi2+yi2+zi2viL_{dop} = \frac{1}{N_s} \sum_{i \in \text{static}} \left| \hat v \cdot \frac{y_i}{\sqrt{x_i^2 + y_i^2 + z_i^2}} - v_i \right|

penalizes discrepancies between predicted ego-velocity and static point Dopplers, and

Lmse=1Nb(vv^)2L_{mse} = \frac{1}{N_b} \sum (v - \hat v)^2

enforces overall speed regression accuracy.

  • MOS Loss:

Lmos=c{static,moving}wclclogl^cL_{mos} = -\sum_{c \in \{\mathrm{static}, \mathrm{moving}\}} w_c l_c \log \hat l_c

with class weights wcw_c compensating for class imbalance.

Training Dynamics

  • Train EVE for 60 epochs, freeze weights, conduct velocity compensation, then train MOS for 50 epochs.
  • Adam optimizer with weight decay 1e31\mathrm{e}{-3}, batch size 4, initial LR 1e31\mathrm{e}{-3}, decayed by 0.5 every 20 or 10 epochs (EVE/MOS).

5. Datasets, Annotations, and Evaluation

RadarMOSEVE introduces new benchmark annotations and datasets for radar-based MOSEVE:

  • View-of-Delft (VoD):

3+1D radar pointclouds; authors re-annotated \sim3,000 frames using LiDAR-based moving labels plus manual correction.

  • ORCA-UBOAT Radar Dataset:

13,654 frames of 4D radar from two platforms (ground vehicle, USV) over various scenarios, annotated by LiDAR/MOS cross-verification and with GNSS/INS ego-speed.

Table: Performance metrics (ORCA-UBOAT dataset)

Method MOS mIoU (%) EVE MAE (m/s) EVE [email protected] m/s
ICP 25.2 0.842 25.2%
RANSAC 32.6 0.601 49.6%
Point-Transformer 54.8 0.330 76.5%
4DMOS 60.8
4DMOS+V 61.7
RadarMOSEVE 70.2 0.182 94.3%

Comparable gains are observed on VoD against RaFlow, CMFlow, and Gaussian-RT. Ablations indicate object attention (OA), scenario attention (SA), and cross-attention (CA) each confer 4–8% mIoU in MOS and 0.02–0.05 m/s in MAE.

6. Limitations and Prospects for Future Work

RadarMOSEVE's reliance on static-background returns limits its ability to disentangle ego-motion from moving-object motion in scenes where all observed points are dynamic. Sparse detection returns, especially for small or fast-moving objects, can yield false negatives. Potential enhancements include multi-modal fusion (camera/LiDAR), and extending the temporal context beyond two frames to mitigate ambiguities in motion attribution. These avenues could further improve robustness to real-world complexities.

7. Significance and Impact

RadarMOSEVE demonstrates that radar-only methods, leveraging spatial-temporal transformers and explicit Doppler integration, can achieve state-of-the-art MOS and EVE from sparse 4D radar, attaining 70.2% MOS mIoU and 0.182 m/s EVE MAE on diverse, annotated datasets. This presents a cost-effective, weather-resilient alternative for autonomous navigation and perception, particularly for environments and conditions unfavorable to optical sensors (Pang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RadarMOSEVE.