RadarMOSEVE: Radar-Only MOS and EVE

Updated 31 January 2026

RadarMOSEVE is a radar-only method that employs spatial-temporal transformers to perform moving object segmentation and ego-velocity estimation using 4D radar point clouds and Doppler velocity.
It integrates novel self- and cross-attention mechanisms to leverage both spatial and temporal cues, enhancing detection and speed regression accuracy under diverse conditions.
Empirical results indicate superior performance with 70.2% MOS mIoU and 0.182 m/s EVE MAE, outperforming LiDAR-based and traditional radar processing methods.

RadarMOSEVE is a transformer-based method developed for radar-only moving object segmentation (MOS) and ego-velocity estimation (EVE) in autonomous mobile systems. Addressing the limitations of LiDAR-based approaches—namely expense and adverse weather sensitivity—RadarMOSEVE processes millimeter-wave radar (MWR) point clouds and leverages both spatial and temporal cues, including direct use of Doppler velocity. It constitutes the first published radar-only method to simultaneously achieve state-of-the-art performance for both MOS and EVE on diverse real-world datasets (Pang et al., 2024).

1. Problem Formulation and Input Representation

RadarMOSEVE operates on 4D radar point clouds, where each input at time $t$ is represented as

$P_t = \{p_i \in \mathbb{R}^4 \,|\, p_i = [x_i,\,y_i,\,z_i,\,v_i]^T, \; i = 1\, \ldots N\}$

with $(x_i, y_i, z_i)$ encoding the 3D radar detection position and $v_i$ the measured radial (Doppler) velocity.

The framework tackles two tightly coupled tasks:

Moving Object Segmentation (MOS): Classify each radar point $p_i$ as static or moving.
Ego-Velocity Estimation (EVE): Regress the sensor platform’s forward speed $v$ .

For truly static points, the measured $v_i$ adheres to

$\hat v_i = -v \cdot \left( \frac{y_i}{\sqrt{x_i^2 + y_i^2 + z_i^2}} \right)$

where the negative sign reflects the Doppler convention for points in front of a forward-moving sensor.

2. Network Architecture and Attention Mechanisms

RadarMOSEVE employs a two-branch Spatial-Temporal Transformer, with a dedicated backbone for each task—EVE and MOS—while sharing novel radar-adapted self- and cross-attention modules.

2.1 Backbone Structure

The EVE branch comprises four feature extraction stages, successively downsampling the point cloud using farthest-point sampling (FPS) at rates $[1, 4, 4, 1]$ , resulting in decreasing point counts $[N_p, N_p/4, N_p/16, N_p/16]$ .

At each stage, two attention mechanisms are deployed:

Object Attention (OA): Local self-attention within a ball of radius $r$ for each query point, with $K$ randomly sampled neighbors.
Scenario Attention (SA): Self-attention over a subsample of the global scene, implementing both farthest-point and interval sampling (spacing parameter $g$ ).

The terminal stage fuses temporal information via cross-attention between the present point set $P_t$ and a past point set $P_{t-a}$ ( $a = 10$ frames).

2.2 Radar Self-Attention

Given neighbor set $Q_i$ for point $x_i \in \mathbb{R}^{D}$ , the update is

$y_i = \sum_{x_j \in Q} \mathrm{softmax}_j \left[ \delta\big(\alpha(x_i) - \beta(x_j) + \omega_{ij}\big) \right] \odot \left( \gamma(x_j) + \omega_{ij} \right)$

where $\alpha$ , $\beta$ , $\gamma$ are shared linear projections; $\delta$ is an MLP; $\omega_{ij}$ encodes positional information via an MLP on the coordinate offsets ( $\mathrm{PE}(p_i-p_j)$ ).

2.3 Radar Cross-Attention

For current point $p_i$ with self-attended feature $y_i$ and $K$ neighbor points $p_j$ in the past frame, features are fused as

$z_i = \sum_{y_j \in Y_i} \mathrm{softmax}_j \left[ \delta'(\alpha'(y_i) - \beta'(y_j) + \epsilon_{ij}) \right] \odot \left( \gamma'(y_j) + \epsilon_{ij} \right)$

with stage-specific projections and positional encoding $\epsilon_{ij}$ .

2.4 Temporal Aggregation

The network operates on two frames (current and lagged), achieving temporal reasoning through repeated bidirectional self- and cross-attention.

3. Exploitation of Radial Velocity

RadarMOSEVE incorporates the measured radial velocity $v_i$ directly as an additional input channel, forming $[x, y, z, v]^\top$ . No explicit gating for velocity is used; rather, the attention mechanisms and positional encodings naturally exploit the 4D structure for neighborhood selection and feature aggregation.

Ablation studies confirm the significance of Doppler velocity: omission of $v_i$ reduces MOS mIoU from 65.6% to 61.1%, and increases EVE MAE from 0.182 m/s to 0.301 m/s.

4. Training Protocol and Objective Functions

Loss Functions

EVE Loss:

$L_{EVE} = L_{dop} + L_{mse}$

where

$L_{dop} = \frac{1}{N_s} \sum_{i \in \text{static}} \left| \hat v \cdot \frac{y_i}{\sqrt{x_i^2 + y_i^2 + z_i^2}} - v_i \right|$

penalizes discrepancies between predicted ego-velocity and static point Dopplers, and

$L_{mse} = \frac{1}{N_b} \sum (v - \hat v)^2$

enforces overall speed regression accuracy.

MOS Loss:

$L_{mos} = -\sum_{c \in \{\mathrm{static}, \mathrm{moving}\}} w_c l_c \log \hat l_c$

with class weights $w_c$ compensating for class imbalance.

Training Dynamics

Train EVE for 60 epochs, freeze weights, conduct velocity compensation, then train MOS for 50 epochs.
Adam optimizer with weight decay $1\mathrm{e}{-3}$ , batch size 4, initial LR $1\mathrm{e}{-3}$ , decayed by 0.5 every 20 or 10 epochs (EVE/MOS).

5. Datasets, Annotations, and Evaluation

RadarMOSEVE introduces new benchmark annotations and datasets for radar-based MOSEVE:

View-of-Delft (VoD):

3+1D radar pointclouds; authors re-annotated $\sim$ 3,000 frames using LiDAR-based moving labels plus manual correction.

ORCA-UBOAT Radar Dataset:

13,654 frames of 4D radar from two platforms (ground vehicle, USV) over various scenarios, annotated by LiDAR/MOS cross-verification and with GNSS/INS ego-speed.

Table: Performance metrics (ORCA-UBOAT dataset)

Method	MOS mIoU (%)	EVE MAE (m/s)	EVE [email protected] m/s
ICP	25.2	0.842	25.2%
RANSAC	32.6	0.601	49.6%
Point-Transformer	54.8	0.330	76.5%
4DMOS	60.8	–	–
4DMOS+V	61.7	–	–
RadarMOSEVE	70.2	0.182	94.3%

Comparable gains are observed on VoD against RaFlow, CMFlow, and Gaussian-RT. Ablations indicate object attention (OA), scenario attention (SA), and cross-attention (CA) each confer 4–8% mIoU in MOS and 0.02–0.05 m/s in MAE.

6. Limitations and Prospects for Future Work

RadarMOSEVE's reliance on static-background returns limits its ability to disentangle ego-motion from moving-object motion in scenes where all observed points are dynamic. Sparse detection returns, especially for small or fast-moving objects, can yield false negatives. Potential enhancements include multi-modal fusion (camera/LiDAR), and extending the temporal context beyond two frames to mitigate ambiguities in motion attribution. These avenues could further improve robustness to real-world complexities.

7. Significance and Impact

RadarMOSEVE demonstrates that radar-only methods, leveraging spatial-temporal transformers and explicit Doppler integration, can achieve state-of-the-art MOS and EVE from sparse 4D radar, attaining 70.2% MOS mIoU and 0.182 m/s EVE MAE on diverse, annotated datasets. This presents a cost-effective, weather-resilient alternative for autonomous navigation and perception, particularly for environments and conditions unfavorable to optical sensors (Pang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

RadarMOSEVE: A Spatial-Temporal Transformer Network for Radar-Only Moving Object Segmentation and Ego-Velocity Estimation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RadarMOSEVE.

RadarMOSEVE: Radar-Only MOS and EVE

1. Problem Formulation and Input Representation

2. Network Architecture and Attention Mechanisms

2.1 Backbone Structure

2.2 Radar Self-Attention

2.3 Radar Cross-Attention

2.4 Temporal Aggregation

3. Exploitation of Radial Velocity

4. Training Protocol and Objective Functions

Loss Functions

Training Dynamics

5. Datasets, Annotations, and Evaluation

6. Limitations and Prospects for Future Work

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RadarMOSEVE: Radar-Only MOS and EVE

1. Problem Formulation and Input Representation

2. Network Architecture and Attention Mechanisms

2.1 Backbone Structure

2.2 Radar Self-Attention

2.3 Radar Cross-Attention

2.4 Temporal Aggregation

3. Exploitation of Radial Velocity

4. Training Protocol and Objective Functions

Loss Functions

Training Dynamics

5. Datasets, Annotations, and Evaluation

6. Limitations and Prospects for Future Work

7. Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research