STU: Spotting the Unexpected Benchmark
- STU is a benchmark that rigorously defines and evaluates anomalies in autonomous driving and code by providing specialized datasets and precise annotation protocols.
- It employs both point-level and object-level metrics—such as AUROC, FPR@95, AP, and PQ—to quantitatively assess anomaly segmentation in LiDAR data.
- In the code domain, STU uses ThrowBench to reliably measure unexpected runtime errors via empirical execution, analyzing exception types with micro-averaged F₁ scores.
The Spotting the Unexpected (STU) benchmark is a data-driven evaluation framework specifically designed to measure the ability of algorithms and models—particularly those in autonomous driving and code understanding domains—to detect and segment anomalies or unexpected behaviors that do not conform to in-distribution expectations. Two principal threads of STU research encompass (1) anomaly (out-of-distribution; OOD) segmentation in 3D LiDAR data for autonomous vehicles (Nekrasov et al., 4 May 2025), and (2) runtime behavioral prediction in code via exception-centric benchmarks such as ThrowBench. The STU benchmark unites these axes by providing datasets, metrics, and protocols tailored to rigorously evaluate “unexpectedness” under real-world conditions, offering both granular (point/instance-level) and holistic (system-level) assessment.
1. Dataset Design and Anomaly Definition
In autonomous driving, the core STU dataset (Nekrasov et al., 4 May 2025) comprises 72 annotated driving sequences (2 closed-set training, 2 closed-set validation, 19 validation for anomaly segmentation, 51 test sequences), all captured using a high-density Ouster OS1-128 LiDAR and eight synchronized high-resolution automotive cameras. Anomalies are defined as objects not present in the closed-set training taxonomy—for example, buckets, surfboards, construction cones, and debris—either naturally occurring or manually placed. Anomaly points are densely annotated at both semantic and instance level; each instance is assigned a sequence-local ID and spatial statistics are recorded (average <50 points/instance, max <1 m height, up to 9 anomalies per scan). Evaluation is restricted to objects within 50 m of the ego-vehicle and with at least 5 LiDAR points, forming a rigorous validation set.
In software, STU’s code-centric benchmark (ThrowBench) targets unexpected runtime errors. The dataset consists of user-written snippets (Python, Java, C♯, Ruby) derived from the RunBugRun subset of IBM CodeNet, deliberately including both buggy and correct submissions. Each program is paired with concrete test input, and the benchmark asks: "Will this program throw any exception, and if so, which type?" The exception space covers 37 types, and, for robust assessment, control examples comprise 10% no-exception cases per language.
2. Annotation Protocols and Data Structure
Annotation in the driving scenario employs the SemanticKITTI labeler, with pseudo-labeling from pretrained segmentation models aiding inlier assignment, followed by multi-annotator cross-validation and explicit void labeling for uncertain regions or objects outside both anomaly and training class taxonomy. Instance masks are clustered using DBSCAN for object-level panoptic metrics. All sensor data (LiDAR scans, camera images, poses, calibration) follow widely adopted binary and text conventions, mirroring the SemanticKITTI format, with anonymization applied to sensitive camera imagery.
In the code domain, ground truth for each ThrowBench sample is established via sandboxed execution on supplied inputs—ensuring that labels (exception class/no exception) reflect true runtime behavior, eliminating the potential for corpus leakage. Exception lists are explicitly enumerated per assessment, and labeling is purely empirical.
3. Evaluation Protocols and Metrics
LiDAR OOD segmentation within STU employs both point-level and object-level metrics:
- Point-level: AUROC (area under ROC for anomaly detection), FPR@95 (false-positive rate at 95% TPR), and AP (average precision).
- Object-level (panoptic): Segmentation Quality (SQ: mean IoU for TP instances ≥0.5), Recognition Quality (RQ: F1-like measure penalizing FP/FN), Unknown Quality (UQ: SQ × Recall for anomaly class, recall is TP/(TP+FN)), and Panoptic Quality (PQ: aggregate anomaly instance quality).
In ThrowBench, models are prompted with the entire snippet and input, plus allowed labels, returning their prediction. Micro-averaged precision, recall, and F₁ are reported, and optional per-class confusion matrices provide error structure analysis. The response protocol specifies using the last allowed label token as the effective prediction.
| Metric | Definition | Scope |
|---|---|---|
| AUROC | Probability OOD ranked above inlier | Point-level |
| FPR@95 | FP rate at 95% TPR | Point-level |
| AP | Area under precision-recall curve | Point-level |
| SQ, RQ, PQ, UQ | See above | Object-level |
| F₁ (ThrowBench) | Micro-averaged harmonic mean of precision/recall | Code |
4. Baselines, Algorithms, and Quantitative Benchmarks
For 3D segmentation, established baselines include Mask4Former-3D (single scan panoptic backbone), MaxLogit, MC-Dropout, Deep Ensemble, Void-classifier (train additional void class), and RbA. Point-level AUROCs for Deep Ensembles reach 86.7 (test), AP peaks at 5.17, while object-level PQ for anomaly segmentation rarely exceeds 9%. Performance degrades beyond 30 m range and with smaller anomaly objects; misclassification frequently occurs for shapes/confidence near familiar classes (e.g., debris as vehicles).
In code, six open-weight code LLMs (7 B–34 B, various quantizations) are benchmarked. Qwen2.5 Coder achieves the highest overall F₁ (38.2%), with further analysis revealing pronounced exception-type sensitivity (e.g., >70% F₁ for zero-division, <10% for null-pointer/overflow errors). This spread suggests a challenge in capturing error-specific semantic structure.
In LiDAR OOD, the Relative Energy Learning (REL) framework with Point Raise achieves state-of-the-art results. On STU validation, REL + Point Raise yields AUROC 97.85, FPR@95 9.60, AP 10.68, and object-level PQ 10.62, substantially outperforming prior methods such as Deep Ensemble and MC-Dropout. Ablation reveals the synthetic OOD generation (Point Raise) as a key factor, with optimal cluster compactness parameter γ=2.
5. Methodological Strengths, Challenges, and Usage
STU’s methodological strengths include execution-based ground truth (precluding label leakage), physically accurate multi-modal sensor data, and densely annotated anomaly instances with rigorous cross-validation. Its scope surpasses prior 2D benchmarks (Lost&Found, Fishyscapes, CODA) by providing surround-view LiDAR and camera coverage; annotation processes are reproducible via public scripts.
However, limitations persist: LiDAR range sparsity and object size impact recall and AP. In code, finite exception sets and single-point testing may oversimplify the detection task and overlook corner-case logical errors, memory exhaustion, or other runtime surprises. Prompting protocols in ThrowBench eschew chain-of-thought reasoning, potentially suppressing performance in nuanced cases.
In usage, STU supports multi-modal fusion, temporal modeling, and open-world assessment. Download scripts and structured data directories (aligned to SemanticKITTI conventions) facilitate adoption and reproducibility. Extensibility across modalities and research axes is explicit.
6. Extensions, Integrative Directions, and Future Scope
A plausible implication is the expansion of STU well beyond its current boundaries. Recommendations include incorporating additional anomaly modalities in both LiDAR (resource leaks, non-terminating loops, silent off-by-one errors, performance anomalies) and code (open-set exception labeling, multi-input test cases, statistical robustness with macro-averaged F₁). Cross-language generalization and explainability (model rationale for predictions) are highlighted as promising future directions. Geometry-aware synthetic OOD data generation, as exemplified by Point Raise, may extend to radar or other sensors.
Integrating ThrowBench’s empirically grounded execution protocol with the anomaly segmentation paradigms from LiDAR offers a framework to probe a model’s capacity to anticipate and identify a spectrum of unexpected behaviors, yielding a comprehensive suite for benchmarking robustness in real-world tasks.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free