R-U-MAAD Benchmark: Anomaly Detection in Urban Driving

Updated 30 November 2025

R-U-MAAD Benchmark is a standard platform for evaluating unsupervised anomaly detection in multi-agent urban driving using realistic trajectory data.
It implements and compares methods including reconstruction-based auto-encoders, one-class SVMs, and end-to-end Deep SVDD to score abnormal behaviors.
Results show deep reconstruction methods, particularly STGAE, outperform linear baselines, highlighting the importance of modeling agent interactions in urban scenarios.

The R-U-MAAD (Realistic Urban Multi-Agent Anomaly Detection) benchmark is a standard platform for evaluating unsupervised anomaly detection algorithms in multi-agent urban driving scenarios. Its primary aim is to facilitate apples-to-apples comparison of various methods, particularly for detecting rare or abnormal agent behaviors from trajectories in realistic urban environments, using representations learned exclusively from normal (inlier) data (Wiederer et al., 2022).

1. Formal Problem Specification

Unsupervised anomaly detection in multi-agent trajectories focuses on learning a scoring function $s(\mathbf{S}_i, \mathbf{S}_{\setminus i}, \mathcal{I})$ that assigns high scores to agents exhibiting abnormal behaviors, given only unlabeled "normal" driving sequences for training. For each agent $i$ , the $2$D position $\mathbf{s}_i^t = (x_i^t, y_i^t)$ is observed over a window $T$ , yielding the trajectory $\mathbf{S}_i = \{\mathbf{s}_i^t \mid t = -T+1, \ldots, 0\}$ . The full scene input is $\mathbf{S} = \{\mathbf{S}_i \mid i=1 \dots N\}$ , optionally augmented with static context $\mathcal{I}$ such as HD maps.

Three principal unsupervised anomaly scoring mechanisms are implemented:

Reconstruction-based scoring (Auto-Encoder):

$s_{\rm rec}(\mathbf{S}_i) = \mathcal{L}_r = \frac{1}{T} \sum_{t=0}^{T-1} \|\mathbf{s}_i^t - \hat{\mathbf{s}}_i^t \|_2^2$

where $\hat{\mathbf{s}}_i^t$ is the decoded output from the encoder-decoder network.

One-class SVM:

The one-class SVM seeks to enclose normal trajectory features $\phi(x_j)$ in a small region, solving

$\min_{w,\rho,\xi} \frac{1}{2}\|w\|^2 + \frac{1}{\nu N}\sum_{j=1}^N \xi_j - \rho, \quad \text{s.t.} \; w^\top \phi(x_j) \geq \rho - \xi_j, \; \xi_j \geq 0$

with test-time score $s_{\rm OCSVM}(x) = \rho - w^\top \phi(x)$ .

Deep SVDD:

Deep SVDD ("Deep Support Vector Data Description") minimizes the squared distance from an embedding $z_j = e(x_j)$ to a fixed center $c$ in latent space:

$\mathcal{L}_a = \frac{1}{N} \sum_{j=1}^{N} \|z_j - c\|_2^2$

with the combined loss $\mathcal{L} = \mathcal{L}_r + \lambda \mathcal{L}_a$ and anomaly score $s_{\rm DSVDD}(x) = \|e(x) - c\|_2$ .

2. Benchmark Construction and Data Annotation

The benchmark re-purposes the Argoverse Motion Forecasting dataset for unsupervised anomaly detection:

Training/Validation: Uses 205,942 training and 39,472 validation sequences from Argoverse, clipped to 1.6s windows ( $T=16$ frames at 10 Hz), exclusively from normal data and without anomaly annotations.
Test Set: Comprised of 160 sequences—80 “normal” and 80 “abnormal.” Each test sequence is generated in simulation by:
- Replaying recorded real-world agents via OpenAI-Gym.
- Hijacking a single target vehicle per scene (rendered in red) to execute abnormal maneuvers under human control, via a kinematic car model aligned to Argoverse dynamics.
- All other agents (“background,” rendered blue) remain as recorded.

Annotation and Abnormality Classes:

Frame-wise human annotations in ELAN designate each frame as one of 9 normal maneuvers, 13 abnormal maneuvers, or “ignore” if inconclusive. Abnormal behaviors are classified as actor-interactive, map-interactive, or both. Distributions are 1,412 abnormal, 5,695 normal, and 438 ignore time-steps.

Abnormal Maneuver Classes:

Class	Actor-inter.	Map-inter.	# frames
ghost driver	✓	✓	202
leave road		✓	186
thwarting	✓		179
cancel turn	✓	✓	156
last minute turn			114
enter wrong lane		✓	101
staggering	✓		92
pushing away	✓		84
swerving (l/r)	✓		77/26
tailgating	✓		69
aggressive shearing (l/r)	✓	✓	62/64

3. Baseline Methods

Eleven baseline models are implemented and grouped as follows:

Linear Reconstruction:
- CVM (Constant Velocity Model): Fits agent velocity over first two frames, extrapolates, computes MSE.
- LTI (Linear Temporal Interpolation): Interpolates points between start and end of window.
Deep Auto-Encoders:
- Seq2Seq: LSTM encoder-decoder (single agent, no context).
- STGAE: Spatio-temporal graph AE with neighboring-agent aggregation (GCN+LSTM).
- LaneGCN-AE: Map- and actor-aware FusionNet (from LaneGCN), repurposed with AE loss.
Two-Stage One-Class Models:

Train above AEs, then fit one-class SVM (RBF kernel; $\gamma\in 2^{-10\ldots -1}$ , $\nu\in\{0.01,0.1\}$ ) on latent codes.

- Seq2Seq+OC-SVM, STGAE+OC-SVM, LaneGCN-AE+OC-SVM.

End-to-End Deep SVDD Models:
- Seq2Seq+DSVDD, STGAE+DSVDD, LaneGCN-AE+DSVDD: Jointly optimize AE reconstruction and DSVDD objectives.

AEs are trained for 36 epochs, selecting the model with the best validation loss. Seq2Seq/STGAE use 8-dimensional embeddings and 16-cell LSTM cells; LaneGCN-AE uses actor feature dimension 16.

4. Evaluation Protocol and Metrics

Scoring is conducted on a sliding window basis with $T=16$ and stride 1, ignoring the first $15$ frames of each sequence. Evaluation is based on threshold-free detection metrics:

AUROC: Area Under the Receiver Operating Characteristic curve.
AUPR $_{\rm Abn}$ : Area under precision-recall with “abnormal” as positive.
AUPR $_{\rm Norm}$ : Area under precision-recall with “normal” as positive.
FPR@95%TPR: False-Positive Rate at 95% True-Positive Rate.

Definitions:

$\text{Precision}(p) = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \quad \text{Recall}(r) = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$

AUPR = $\int_0^1 p(r) dr$ , AUROC = $\int_0^1 t(f) df$ .

5. Quantitative Results

Performance of all baseline models (test set, 160 sequences):

Category	Method	AUPR $_{\rm Abn}$ ↑	AUPR $_{\rm Nm}$ ↑	AUROC ↑	FPR@95%TPR ↓
Linear Reconstruction	CVM	47.19	86.00	72.30	81.20
	LTI	50.45	85.71	73.14	82.22
Deep Auto-Encoders	Seq2Seq	59.21	88.07	76.56	77.62
	STGAE	59.65	87.85	76.75	76.48
	LaneGCN-AE	57.19	87.22	75.25	75.94
Two-Stage One-Class	Seq2Seq+OC-SVM	34.47	70.25	50.47	98.33
	STGAE+OC-SVM	33.32	77.71	59.16	91.27
	LaneGCN-AE+OC-SVM	51.88	86.93	72.94	82.02
End-to-End DSVDD	Seq2Seq+DSVDD	51.37	82.47	69.34	88.79
	STGAE+DSVDD	48.09	83.59	69.65	85.44
	LaneGCN-AE+DSVDD	53.14	85.21	72.33	85.55

Key findings:

Deep auto-encoder reconstruction methods, particularly STGAE, outperform linear and OC-SVM baselines in all major metrics.
STGAE yields the highest AUPR-Abnormal (59.65%) and AUROC (76.75%).
End-to-end DSVDD models, especially on LaneGCN-AE, close the gap to pure reconstruction baselines.
OC-SVM applied to AE latent codes underperforms, indicating that anomalies are better separated in output than in latent space.

6. Main Insights and Research Directions

R-U-MAAD standardizes rigorous evaluation of unsupervised anomaly detection for multi-agent urban driving. Deep reconstruction methods are currently the most effective, with explicit modeling of agent interactions (STGAE) offering incremental gains. Joint training with deep SVDD objectives provides additional robustness to anomalies. Linear methods and traditional OC-SVMs applied to latent features do not suffice for identifying complex urban driving anomalies.

Challenges persist in formulating map- and interaction-aware anomaly losses, propagating supervision via semi-supervised or label-efficient methods, and enabling models to adapt online to novel scenes or situations. Continued research toward more sophisticated representation learning and detection functions is required for achievement of robust, real-world multi-agent anomaly detection (Wiederer et al., 2022).

PDF Markdown Chat (Pro)

References (1)

A Benchmark for Unsupervised Anomaly Detection in Multi-Agent Trajectories (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to R-U-MAAD Benchmark.