CrashEvent Benchmark

Updated 20 January 2026

CrashEvent Benchmark is a rigorously structured dataset that defines reproducible crash events with detailed metadata and standardized evaluation metrics.
It employs curated data from diverse sources like software bug repositories and real-world crash reports to support apples-to-apples analysis across various domains.
The benchmark's comprehensive annotation and complexity metrics enable precise evaluation of crash diagnosis techniques and reproducibility in safety-critical applications.

A CrashEvent Benchmark is a rigorously structured dataset supporting the evaluation, comparison, and quantitative measurement of error-triggering events (“crash events”) in software, data science workflows, or cyber-physical systems. The concept has been systematically applied in multiple domains, ranging from Automated Crash Reproduction (ACR) for programming languages, root-cause analysis pipelines, machine learning notebook debugging, and safety evaluation of real-world vehicle automation systems. Benchmarks such as CrashJS for Node.js/JavaScript (Oliver et al., 2024), JunoBench for Python ML notebooks (Wang et al., 20 Oct 2025), RCABench for root-cause analysis (Nishimura et al., 2023), and ADS crash-rate benchmarks (Scanlon et al., 26 Aug 2025), illustrate the span of formal, curated approaches for reproducible, apples-to-apples analysis of crash diagnosis techniques. Central to all these is the definition of a "CrashEvent": a precisely characterized instance of a system or program fault, often with accompanying metadata, inputs, and reproducibility requirements.

1. Benchmark Structure and Formalization

CrashEvent Benchmarks define the atomic event—a crash—in a program or system as a reproducible occurrence rooted in a specific configuration and input. Formally, a CrashEvent is described at varying levels of detail:

CrashJS structures each event as a directory containing a raw stack trace ".log", a structured TypeScript ".json" descriptor enumerating metadata (issue number, error summary, program and dependency version, input setup), and the harness or test script that triggered the error. The expected output for an ACR tool is replication of the original stack trace (Oliver et al., 2024).
JunoBench models each CrashEvent $E_k$ as a triple $\langle N^k_b, N^k_f, D^k\rangle$ , where $N^k_b$ is the minimal buggy notebook, $N^k_f$ is the minimally fixed notebook, and $D^k$ the input datasets required for reproduction. Each event is annotated with crash type, root cause, ML pipeline stage, and library cause (Wang et al., 20 Oct 2025).
RCABench encapsulates a CrashEvent as $(P, I_0, L_{\text{crash}})$ : the program under test, the initial crashing input, and the crash point in the execution trace—further parameterized with sets of root-cause line numbers, transformed via crash exploration fuzzers (Nishimura et al., 2023).
ADS crash-rate benchmarks define events in aggregated terms—number of crashed vehicles, miles traveled—supporting statistical modeling on broader system performance (Scanlon et al., 26 Aug 2025).

2. Data Sources and Curation Methodologies

CrashEvent Benchmarks rely on empirical, multi-source data collection emphasizing real-world, reproducible faults. Approaches include:

Extraction from public bug repositories (e.g., GitHub issues), structured test failures (e.g., BugsJS for Node.js), targeted vulnerability benchmarks (SecBench.js), and synthetic crash generation (Syntest-JS) in CrashJS (Oliver et al., 2024).
Large-scale mining and manual curation of crashing Jupyter notebooks from Kaggle, with focus on reproducibility, dataset availability, and verifiable fixes in JunoBench (Wang et al., 20 Oct 2025).
Aggregation of PoC crash inputs from official advisories and targeted fuzzing, coupled with patch-based root-cause tagging and version control, in RCABench (Nishimura et al., 2023).
Collection and filtering of police-reported crash files and vehicle miles traveled (VMT) datasets, with classification by road type, vehicle type, crash severity, and local imputation in ADS benchmarks (Scanlon et al., 26 Aug 2025).

All benchmarks impose rigorous deduplication (exact stack trace sequencing, frame matching, erratum merging), version and dependency tagging, and reproducibility standards to ensure methodological integrity and broad applicability.

3. Taxonomy and Event Annotation

CrashEvent Benchmarks employ rich multidimensional categorization schemes:

CrashJS: error type taxonomy (TypeError, AssertionError, etc.), stack-trace frame depth (Low/Medium/High), language feature triggers (I/O, YAML parsing, dynamic type misuse), project-source stratification (Oliver et al., 2024).
JunoBench: crash type, root cause, ML pipeline stage, and primary ML library involved, supporting stratified analysis by model construction, API misuse, and notebook execution ordering errors (Wang et al., 20 Oct 2025).
RCABench: root-cause sets (manual patch line numbers, accommodating non-unique fixes), input provenance, fuzzing seed, and time budget, supporting controls across statistical, procedural, and semantic axes (Nishimura et al., 2023).
ADS: crash rates stratified by outcome severity (police-reported, any-injury, airbag deployment, serious injury, fatality), road type (freeway/surface street), vehicle category (passenger cars), and crash typology (rear-end, lateral, head-on, VRU involvement) (Scanlon et al., 26 Aug 2025).

This fine-grained event annotation enables benchmark users to target, compare, and generalize crash-diagnosis methodologies beyond binary pass/fail metrics.

4. Complexity Metrics and Evaluation Protocols

Benchmarks operationalize technical complexity using domain-appropriate metrics:

CrashJS measures crash complexity via stack-trace depth, program complexity via Cyclomatic Complexity Number (CCN: $M = E - N + 2P$ ), and composite normalized project complexity $C_o(p) = (F_p/F_{\max}) + (CCN_p/CCN_{\max})$ (Oliver et al., 2024). Summary statistics and distributions support comparative benchmarking across sources.
JunoBench focuses on cell-level reproduction, cell-level crash detection accuracy, and supports precision/recall benchmarking for automated repair systems leveraging paired diffs and execution order control (Wang et al., 20 Oct 2025).
RCABench defines rank, top- $k$ accuracy, precision@k, recall@k, $F_1@k$ , and time-to-diagnosis ( $T_{\text{diag}@k}$ ), and for stakeholders, a weighted composite score $S_k$ (Nishimura et al., 2023). Parameter sweeps and statistical analysis (mean/stddev, worst/best case, confidence intervals) are mandatory.
For ADS evaluation, crash rates $R = N_{\text{crashes}}/\text{VMT}$ (expressed in IPMM), percent-relative safety impact $\Delta\% = (R_{\text{ADS}}/R_{\text{Human}}-1)\times 100\%$ , and required mileage for statistical power are quantified (Scanlon et al., 26 Aug 2025):

$M = \frac{(z_{1-\beta}\sqrt{A_A} + z_{1-\alpha/2}\sqrt{A_H})^2}{(A_A - A_H)^2}.$

Protocols specify time budgets, input and seed control, repetition count ( $N\geq5$ ), and require detailed result aggregation for fair tool and approach comparison.

5. Dataset Composition, Coverage, and Practical Use

CrashEvent Benchmarks document their coverage and encapsulation strategies:

CrashJS includes 453 deduplicated events: 71 GitHub, 90 BugsJS, 17 SecBench.js, 275 Syntest-JS, spanning 33 projects/versions. Complexity scores per source: BugsJS (0.73), SecBench.js (0.72), GitHub (0.61), Syntest-JS (0.60). Benchmark entries are directory-based and installation-agnostic, supporting direct integration with any ACR pipeline (Oliver et al., 2024).
JunoBench’s 111 CrashEvents balance ML libraries, crash types, and root causes, with distributions reported through stack charts and summary tables for pipeline design, API misuse, and data shape errors, all executed/reproduced within unified Docker environments and validated via CLI tools (Wang et al., 20 Oct 2025).
RCABench targets seven open-source projects with known security vulnerabilities, shipping PoC inputs, ground-truth patch lines, and supporting plugins for new RCA techniques, with standardized JSON-based output and modular orchestration (Nishimura et al., 2023).
ADS benchmarks cover five major US urban areas, stratified crash-rate statistics, and provide power analysis curves for benchmarking safety impacts at granular outcome levels (Scanlon et al., 26 Aug 2025).

These datasets enable rigorous, scalable evaluation of crash-detection, reproduction, repair, and safety-performance tools by practitioners and researchers, under reproducible and well-controlled conditions.

6. Experimental Guidance and Best Practices

CrashEvent Benchmarks promulgate usage and integration recommendations:

Ensure explicit version-tagging, dependency management, and reproducibility (e.g., pinned environments, Docker images, standardized CLI tools).
Control and report on input seeds, augmentation parameters, and statistical analysis, highlighting both mean and extreme cases.
Decouple augmentation/fuzzing from analysis for orthogonal evaluation of pipeline stages and new technique combinations.
Adopt rank-based, overlap-based, and time-based metrics for comprehensive tool assessment.
Release all adapters, configuration files, and code under open licenses for community-driven extension and validation.
For cross-domain application (e.g., automated driving), tailor benchmarks by operating domain, geography, exposure level, and ensure stratification by outcome severity and crash typology.

A plausible implication is that the combination of rich, annotated CrashEvent datasets, standardized evaluation metrics, controlled experimental methodology, and reproducible setup forms a robust foundation for comparative research and practical adoption of crash analysis and reproduction technologies in software engineering, machine learning, and safety-critical domains.