R-U-MAAD Benchmark: Anomaly Detection in Urban Driving
- R-U-MAAD Benchmark is a standard platform for evaluating unsupervised anomaly detection in multi-agent urban driving using realistic trajectory data.
- It implements and compares methods including reconstruction-based auto-encoders, one-class SVMs, and end-to-end Deep SVDD to score abnormal behaviors.
- Results show deep reconstruction methods, particularly STGAE, outperform linear baselines, highlighting the importance of modeling agent interactions in urban scenarios.
The R-U-MAAD (Realistic Urban Multi-Agent Anomaly Detection) benchmark is a standard platform for evaluating unsupervised anomaly detection algorithms in multi-agent urban driving scenarios. Its primary aim is to facilitate apples-to-apples comparison of various methods, particularly for detecting rare or abnormal agent behaviors from trajectories in realistic urban environments, using representations learned exclusively from normal (inlier) data (Wiederer et al., 2022).
1. Formal Problem Specification
Unsupervised anomaly detection in multi-agent trajectories focuses on learning a scoring function that assigns high scores to agents exhibiting abnormal behaviors, given only unlabeled "normal" driving sequences for training. For each agent , the $2$D position is observed over a window , yielding the trajectory . The full scene input is , optionally augmented with static context such as HD maps.
Three principal unsupervised anomaly scoring mechanisms are implemented:
- Reconstruction-based scoring (Auto-Encoder):
where is the decoded output from the encoder-decoder network.
- One-class SVM:
The one-class SVM seeks to enclose normal trajectory features in a small region, solving
with test-time score .
- Deep SVDD:
Deep SVDD ("Deep Support Vector Data Description") minimizes the squared distance from an embedding to a fixed center in latent space:
with the combined loss and anomaly score .
2. Benchmark Construction and Data Annotation
The benchmark re-purposes the Argoverse Motion Forecasting dataset for unsupervised anomaly detection:
- Training/Validation: Uses 205,942 training and 39,472 validation sequences from Argoverse, clipped to 1.6s windows ( frames at 10 Hz), exclusively from normal data and without anomaly annotations.
- Test Set: Comprised of 160 sequences—80 “normal” and 80 “abnormal.” Each test sequence is generated in simulation by:
- Replaying recorded real-world agents via OpenAI-Gym.
- Hijacking a single target vehicle per scene (rendered in red) to execute abnormal maneuvers under human control, via a kinematic car model aligned to Argoverse dynamics.
- All other agents (“background,” rendered blue) remain as recorded.
Annotation and Abnormality Classes:
Frame-wise human annotations in ELAN designate each frame as one of 9 normal maneuvers, 13 abnormal maneuvers, or “ignore” if inconclusive. Abnormal behaviors are classified as actor-interactive, map-interactive, or both. Distributions are 1,412 abnormal, 5,695 normal, and 438 ignore time-steps.
Abnormal Maneuver Classes:
| Class | Actor-inter. | Map-inter. | # frames |
|---|---|---|---|
| ghost driver | ✓ | ✓ | 202 |
| leave road | ✓ | 186 | |
| thwarting | ✓ | 179 | |
| cancel turn | ✓ | ✓ | 156 |
| last minute turn | 114 | ||
| enter wrong lane | ✓ | 101 | |
| staggering | ✓ | 92 | |
| pushing away | ✓ | 84 | |
| swerving (l/r) | ✓ | 77/26 | |
| tailgating | ✓ | 69 | |
| aggressive shearing (l/r) | ✓ | ✓ | 62/64 |
3. Baseline Methods
Eleven baseline models are implemented and grouped as follows:
- Linear Reconstruction:
- CVM (Constant Velocity Model): Fits agent velocity over first two frames, extrapolates, computes MSE.
- LTI (Linear Temporal Interpolation): Interpolates points between start and end of window.
- Deep Auto-Encoders:
- Two-Stage One-Class Models:
Train above AEs, then fit one-class SVM (RBF kernel; , ) on latent codes.
- Seq2Seq+OC-SVM, STGAE+OC-SVM, LaneGCN-AE+OC-SVM.
- End-to-End Deep SVDD Models:
- Seq2Seq+DSVDD, STGAE+DSVDD, LaneGCN-AE+DSVDD: Jointly optimize AE reconstruction and DSVDD objectives.
AEs are trained for 36 epochs, selecting the model with the best validation loss. Seq2Seq/STGAE use 8-dimensional embeddings and 16-cell LSTM cells; LaneGCN-AE uses actor feature dimension 16.
4. Evaluation Protocol and Metrics
Scoring is conducted on a sliding window basis with and stride 1, ignoring the first $15$ frames of each sequence. Evaluation is based on threshold-free detection metrics:
- AUROC: Area Under the Receiver Operating Characteristic curve.
- AUPR: Area under precision-recall with “abnormal” as positive.
- AUPR: Area under precision-recall with “normal” as positive.
- FPR@95%TPR: False-Positive Rate at 95% True-Positive Rate.
Definitions:
AUPR = , AUROC = .
5. Quantitative Results
Performance of all baseline models (test set, 160 sequences):
| Category | Method | AUPR ↑ | AUPR ↑ | AUROC ↑ | FPR@95%TPR ↓ |
|---|---|---|---|---|---|
| Linear Reconstruction | CVM | 47.19 | 86.00 | 72.30 | 81.20 |
| LTI | 50.45 | 85.71 | 73.14 | 82.22 | |
| Deep Auto-Encoders | Seq2Seq | 59.21 | 88.07 | 76.56 | 77.62 |
| STGAE | 59.65 | 87.85 | 76.75 | 76.48 | |
| LaneGCN-AE | 57.19 | 87.22 | 75.25 | 75.94 | |
| Two-Stage One-Class | Seq2Seq+OC-SVM | 34.47 | 70.25 | 50.47 | 98.33 |
| STGAE+OC-SVM | 33.32 | 77.71 | 59.16 | 91.27 | |
| LaneGCN-AE+OC-SVM | 51.88 | 86.93 | 72.94 | 82.02 | |
| End-to-End DSVDD | Seq2Seq+DSVDD | 51.37 | 82.47 | 69.34 | 88.79 |
| STGAE+DSVDD | 48.09 | 83.59 | 69.65 | 85.44 | |
| LaneGCN-AE+DSVDD | 53.14 | 85.21 | 72.33 | 85.55 |
Key findings:
- Deep auto-encoder reconstruction methods, particularly STGAE, outperform linear and OC-SVM baselines in all major metrics.
- STGAE yields the highest AUPR-Abnormal (59.65%) and AUROC (76.75%).
- End-to-end DSVDD models, especially on LaneGCN-AE, close the gap to pure reconstruction baselines.
- OC-SVM applied to AE latent codes underperforms, indicating that anomalies are better separated in output than in latent space.
6. Main Insights and Research Directions
R-U-MAAD standardizes rigorous evaluation of unsupervised anomaly detection for multi-agent urban driving. Deep reconstruction methods are currently the most effective, with explicit modeling of agent interactions (STGAE) offering incremental gains. Joint training with deep SVDD objectives provides additional robustness to anomalies. Linear methods and traditional OC-SVMs applied to latent features do not suffice for identifying complex urban driving anomalies.
Challenges persist in formulating map- and interaction-aware anomaly losses, propagating supervision via semi-supervised or label-efficient methods, and enabling models to adapt online to novel scenes or situations. Continued research toward more sophisticated representation learning and detection functions is required for achievement of robust, real-world multi-agent anomaly detection (Wiederer et al., 2022).