Papers
Topics
Authors
Recent
2000 character limit reached

R-U-MAAD Benchmark: Anomaly Detection in Urban Driving

Updated 30 November 2025
  • R-U-MAAD Benchmark is a standard platform for evaluating unsupervised anomaly detection in multi-agent urban driving using realistic trajectory data.
  • It implements and compares methods including reconstruction-based auto-encoders, one-class SVMs, and end-to-end Deep SVDD to score abnormal behaviors.
  • Results show deep reconstruction methods, particularly STGAE, outperform linear baselines, highlighting the importance of modeling agent interactions in urban scenarios.

The R-U-MAAD (Realistic Urban Multi-Agent Anomaly Detection) benchmark is a standard platform for evaluating unsupervised anomaly detection algorithms in multi-agent urban driving scenarios. Its primary aim is to facilitate apples-to-apples comparison of various methods, particularly for detecting rare or abnormal agent behaviors from trajectories in realistic urban environments, using representations learned exclusively from normal (inlier) data (Wiederer et al., 2022).

1. Formal Problem Specification

Unsupervised anomaly detection in multi-agent trajectories focuses on learning a scoring function s(Si,Si,I)s(\mathbf{S}_i, \mathbf{S}_{\setminus i}, \mathcal{I}) that assigns high scores to agents exhibiting abnormal behaviors, given only unlabeled "normal" driving sequences for training. For each agent ii, the $2$D position sit=(xit,yit)\mathbf{s}_i^t = (x_i^t, y_i^t) is observed over a window TT, yielding the trajectory Si={sitt=T+1,,0}\mathbf{S}_i = \{\mathbf{s}_i^t \mid t = -T+1, \ldots, 0\}. The full scene input is S={Sii=1N}\mathbf{S} = \{\mathbf{S}_i \mid i=1 \dots N\}, optionally augmented with static context I\mathcal{I} such as HD maps.

Three principal unsupervised anomaly scoring mechanisms are implemented:

  • Reconstruction-based scoring (Auto-Encoder):

srec(Si)=Lr=1Tt=0T1sits^it22s_{\rm rec}(\mathbf{S}_i) = \mathcal{L}_r = \frac{1}{T} \sum_{t=0}^{T-1} \|\mathbf{s}_i^t - \hat{\mathbf{s}}_i^t \|_2^2

where s^it\hat{\mathbf{s}}_i^t is the decoded output from the encoder-decoder network.

  • One-class SVM:

The one-class SVM seeks to enclose normal trajectory features ϕ(xj)\phi(x_j) in a small region, solving

minw,ρ,ξ12w2+1νNj=1Nξjρ,s.t.  wϕ(xj)ρξj,  ξj0\min_{w,\rho,\xi} \frac{1}{2}\|w\|^2 + \frac{1}{\nu N}\sum_{j=1}^N \xi_j - \rho, \quad \text{s.t.} \; w^\top \phi(x_j) \geq \rho - \xi_j, \; \xi_j \geq 0

with test-time score sOCSVM(x)=ρwϕ(x)s_{\rm OCSVM}(x) = \rho - w^\top \phi(x).

  • Deep SVDD:

Deep SVDD ("Deep Support Vector Data Description") minimizes the squared distance from an embedding zj=e(xj)z_j = e(x_j) to a fixed center cc in latent space:

La=1Nj=1Nzjc22\mathcal{L}_a = \frac{1}{N} \sum_{j=1}^{N} \|z_j - c\|_2^2

with the combined loss L=Lr+λLa\mathcal{L} = \mathcal{L}_r + \lambda \mathcal{L}_a and anomaly score sDSVDD(x)=e(x)c2s_{\rm DSVDD}(x) = \|e(x) - c\|_2.

2. Benchmark Construction and Data Annotation

The benchmark re-purposes the Argoverse Motion Forecasting dataset for unsupervised anomaly detection:

  • Training/Validation: Uses 205,942 training and 39,472 validation sequences from Argoverse, clipped to 1.6s windows (T=16T=16 frames at 10 Hz), exclusively from normal data and without anomaly annotations.
  • Test Set: Comprised of 160 sequences—80 “normal” and 80 “abnormal.” Each test sequence is generated in simulation by:
    • Replaying recorded real-world agents via OpenAI-Gym.
    • Hijacking a single target vehicle per scene (rendered in red) to execute abnormal maneuvers under human control, via a kinematic car model aligned to Argoverse dynamics.
    • All other agents (“background,” rendered blue) remain as recorded.

Annotation and Abnormality Classes:

Frame-wise human annotations in ELAN designate each frame as one of 9 normal maneuvers, 13 abnormal maneuvers, or “ignore” if inconclusive. Abnormal behaviors are classified as actor-interactive, map-interactive, or both. Distributions are 1,412 abnormal, 5,695 normal, and 438 ignore time-steps.

Abnormal Maneuver Classes:

Class Actor-inter. Map-inter. # frames
ghost driver 202
leave road 186
thwarting 179
cancel turn 156
last minute turn 114
enter wrong lane 101
staggering 92
pushing away 84
swerving (l/r) 77/26
tailgating 69
aggressive shearing (l/r) 62/64

3. Baseline Methods

Eleven baseline models are implemented and grouped as follows:

  • Linear Reconstruction:
    • CVM (Constant Velocity Model): Fits agent velocity over first two frames, extrapolates, computes MSE.
    • LTI (Linear Temporal Interpolation): Interpolates points between start and end of window.
  • Deep Auto-Encoders:
    • Seq2Seq: LSTM encoder-decoder (single agent, no context).
    • STGAE: Spatio-temporal graph AE with neighboring-agent aggregation (GCN+LSTM).
    • LaneGCN-AE: Map- and actor-aware FusionNet (from LaneGCN), repurposed with AE loss.
  • Two-Stage One-Class Models:

Train above AEs, then fit one-class SVM (RBF kernel; γ2101\gamma\in 2^{-10\ldots -1}, ν{0.01,0.1}\nu\in\{0.01,0.1\}) on latent codes.

- Seq2Seq+OC-SVM, STGAE+OC-SVM, LaneGCN-AE+OC-SVM.

  • End-to-End Deep SVDD Models:
    • Seq2Seq+DSVDD, STGAE+DSVDD, LaneGCN-AE+DSVDD: Jointly optimize AE reconstruction and DSVDD objectives.

AEs are trained for 36 epochs, selecting the model with the best validation loss. Seq2Seq/STGAE use 8-dimensional embeddings and 16-cell LSTM cells; LaneGCN-AE uses actor feature dimension 16.

4. Evaluation Protocol and Metrics

Scoring is conducted on a sliding window basis with T=16T=16 and stride 1, ignoring the first $15$ frames of each sequence. Evaluation is based on threshold-free detection metrics:

  • AUROC: Area Under the Receiver Operating Characteristic curve.
  • AUPRAbn_{\rm Abn}: Area under precision-recall with “abnormal” as positive.
  • AUPRNorm_{\rm Norm}: Area under precision-recall with “normal” as positive.
  • FPR@95%TPR: False-Positive Rate at 95% True-Positive Rate.

Definitions:

Precision(p)=TPTP+FP,Recall(r)=TPTP+FN\text{Precision}(p) = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}, \quad \text{Recall}(r) = \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}

AUPR = 01p(r)dr\int_0^1 p(r) dr, AUROC = 01t(f)df\int_0^1 t(f) df.

5. Quantitative Results

Performance of all baseline models (test set, 160 sequences):

Category Method AUPRAbn_{\rm Abn} AUPRNm_{\rm Nm} AUROC ↑ FPR@95%TPR ↓
Linear Reconstruction CVM 47.19 86.00 72.30 81.20
LTI 50.45 85.71 73.14 82.22
Deep Auto-Encoders Seq2Seq 59.21 88.07 76.56 77.62
STGAE 59.65 87.85 76.75 76.48
LaneGCN-AE 57.19 87.22 75.25 75.94
Two-Stage One-Class Seq2Seq+OC-SVM 34.47 70.25 50.47 98.33
STGAE+OC-SVM 33.32 77.71 59.16 91.27
LaneGCN-AE+OC-SVM 51.88 86.93 72.94 82.02
End-to-End DSVDD Seq2Seq+DSVDD 51.37 82.47 69.34 88.79
STGAE+DSVDD 48.09 83.59 69.65 85.44
LaneGCN-AE+DSVDD 53.14 85.21 72.33 85.55

Key findings:

  • Deep auto-encoder reconstruction methods, particularly STGAE, outperform linear and OC-SVM baselines in all major metrics.
  • STGAE yields the highest AUPR-Abnormal (59.65%) and AUROC (76.75%).
  • End-to-end DSVDD models, especially on LaneGCN-AE, close the gap to pure reconstruction baselines.
  • OC-SVM applied to AE latent codes underperforms, indicating that anomalies are better separated in output than in latent space.

6. Main Insights and Research Directions

R-U-MAAD standardizes rigorous evaluation of unsupervised anomaly detection for multi-agent urban driving. Deep reconstruction methods are currently the most effective, with explicit modeling of agent interactions (STGAE) offering incremental gains. Joint training with deep SVDD objectives provides additional robustness to anomalies. Linear methods and traditional OC-SVMs applied to latent features do not suffice for identifying complex urban driving anomalies.

Challenges persist in formulating map- and interaction-aware anomaly losses, propagating supervision via semi-supervised or label-efficient methods, and enabling models to adapt online to novel scenes or situations. Continued research toward more sophisticated representation learning and detection functions is required for achievement of robust, real-world multi-agent anomaly detection (Wiederer et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to R-U-MAAD Benchmark.