Reproducible CVE Benchmarking

Updated 21 October 2025

The paper identifies key benchmarking crimes in CVE evaluation, detailing how selective measures and improper computations undermine reproducibility.
It demonstrates a systematic methodology using dual independent assessments and statistical analysis to highlight persistent pitfalls over time.
Practical recommendations include comprehensive testing, proper aggregation methods, and full context documentation to enhance reliable security benchmarks.

Reproducible CVE benchmarking refers to the rigorous, auditable, and repeatable evaluation of security systems and vulnerability detection methodologies using the Common Vulnerabilities and Exposures (CVE) encoding. It demands precise experiment design, comprehensive reporting, representative datasets, methodological soundness, and strict avoidance of biases or “benchmarking crimes.” Reproducibility in this context is not only a scientific ideal but a practical necessity for advancing systems security and comparative analysis of new defenses, detectors, and mitigations.

1. Benchmarking Crimes and Threats to Validity

The taxonomy of “benchmarking crimes” (Kouwe et al., 2018) unveils 22 distinct pitfalls that fundamentally jeopardize reproducible CVE benchmarking. These crimes are broadly grouped as follows:

Selective Benchmarking (A1–A3): Omitting relevant performance dimensions, unjustified selection of benchmark subsets, and hiding deficiencies by controlling input parameters undermine completeness and mask real-world costs.
Improper Handling of Results (B1–B5): Microbenchmark reliance, mistaken overhead computation (e.g., equating throughput loss to overhead without full load), creative accounting, missing significance indicators, and incorrect aggregation (arithmetic versus geometric mean) produce misleading outcomes.
Wrong Benchmarks (C1–C3): Evaluating simplified systems, inappropriate benchmark selection, and calibrating/validating on overlapping datasets erode external validity.
Faulty Comparisons (D1–D3): Lack of proper baselines, self-comparison, and unfair competitive benchmarking obscure relevance and comparability.
Benchmarking Omissions (E1–E4): Unverified contributions, omission of accuracy assessment (false positives/negatives), or lack of incremental testing degrade evaluation quality.
Missing Information (F1–F4): Poor reporting of platform details, software versions, subbenchmarks, or reliance only on relative numbers impedes reproducibility and sanity-checking.

These crimes are widespread and persistent, with virtually all surveyed system security papers committing multiple infractions, particularly those at tier-1 venues. Failure to avoid these errors results in incomparability, irreproducibility, and invalid scientific claims.

2. Survey Methodology and Statistical Findings

A systematic evaluation of 50 defense papers (Kouwe et al., 2018) across two time points (2010, 2015) demonstrates the constant scale and pervasiveness of benchmarking crimes:

Average of five crimes per paper; high-impact crimes (A1, B2, D1) especially common.
Only one paper was entirely free of crimes.
About 30% of crime/opportunity pairs were violated or underspecified at both time periods.
Statistical analysis (χ² test) found little improvement over time, with only the “not all contributions evaluated” (E1) crime showing significant reduction ( $p = 0.001$ ).

The methodology included dual independent assessments and consensus discussions, with full documentation to ensure reproducibility of classifications.

3. Impact on CVE Benchmarking and Reproducibility

Many benchmarking crimes directly translate to CVE benchmarking:

Selective testing (A2, A3): Avoiding high-overhead CVEs or cherry-picking test cases can make a defense appear effective but unfit for general use.
Accuracy omissions (E3): Not measuring false positives and false negatives in vulnerability detection, a recurring issue, leads to unsound claims about practical defense efficacy.
Lack of system context (F1, F2): Missing platform and software version details yields non-reproducible experimental setups; subtle configuration shifts can change vulnerability manifestation.
Baseline errors (D1): Sound comparisons require referencing the genuine baseline, not a partially mitigated system.
Relative-only metrics (F4): Without absolute numbers, the assessment is unanchored, precluding external verification.

Consequently, reproducibility is deeply compromised, and results cannot reliably guide future research or operational deployment.

4. Recommendations for Reproducible CVE Benchmarking

To increase quality and reproducibility, several recommendations are enumerated (Kouwe et al., 2018):

Comprehensive Testing: Evaluate across all dimensions affected by defenses (CPU, memory, I/O, concurrency, detection accuracy, etc.) using both macro- and microbenchmarks.
Justified Benchmark/Case Selection: When using subsets (benchmarks or CVEs), justify exclusions clearly.
Correct Overhead Computation: Use $\,\text{overhead} = \left(\frac{T_1}{T_0} - 1\right)$ rather than $1 - T_0/T_1$ ; always run under full load to avoid masked overhead.
Proper Aggregation: Use the geometric mean for overhead ratios, not arithmetic means:

$GM = \left(\prod_{i=1}^n r_i\right)^{1/n}$

Variance and Significance Reporting: Always present standard deviations, confidence intervals, and significance tests; this enables meaningful comparisons.
Full Context Specification: Document platform (hardware and software versions), input suites, configuration, and provide both absolute and relative performance numbers.
Fair Baseline and Comparisons: Establish the true baseline (e.g., original unprotected system) and use consistent, unbiased configurations for competitors.
Data Splits: Avoid calibration/validation data overlap; report the process used for selecting datasets and CVEs. By implementing these practices, the incidence of high-impact benchmarking crimes could be reduced substantially.

5. Rationale, Technical Detail, and Exemplification

Technical rigor in reproducible benchmarking is illustrated by correct definitions:

Throughput overhead: Always factor in system load so that idle cycles do not mask real costs.
Overhead computation: If reference runtime is $T_0$ and measured runtime is $T_1$ , the proper formula:

$\text{Overhead (\%)} = \left(\frac{T_1}{T_0} - 1\right) \times 100$

Not $1 - T_0/T_1$ , which underestimates actual slowdowns for multiplicative effects.

Aggregation: Geometric mean is used for multiplicative ratios (e.g., slowdowns across benchmarks), so reporting arithmetic mean is technically unsound.

These principles, when systematically documented and justified, enable reproducibility and facilitate sanity checking.

6. Scientific and Practical Significance

The scientific process depends on reproducibility and comparability to advance the state of the art. Benchmarking crimes directly threaten this process, resulting in scientific stagnation, misleading conclusions, and wasteful resource allocation (Kouwe et al., 2018). The recommendations set forth are not merely best practices but essential requirements for the integrity of CVE benchmarking—where performance, security, functional correctness, and detection capability must all be thoroughly and transparently assessed.

Researchers are encouraged to meticulously document all aspects of their benchmarking, justify their methodological choices, and adhere to rigorous standards when reporting CVE-based evaluations. This is critical for trustworthy advancement in systems security and for the practical utility of published research.

7. Summary Table: Benchmarking Crimes Most Relevant to CVE Evaluation

Crime Code	Short Description	Impact on CVE Benchmarking
A2	Benchmark subsetting w/o justification	Masking/undercounting critical vulnerabilities
B3	Creative overhead accounting	Underestimating resource costs of defense
E3	False positives/negatives not tested	Omission of key effectiveness metrics
F2	Missing software versions	Setup irreproducibility, invalid comparisons
D1	No proper baseline	Artificial improvement claims; unsound comparisons

Implementing the paper’s full recommendations would address all entries above and foster robust, reproducible CVE benchmarking.

In conclusion, reproducible CVE benchmarking necessitates careful avoidance of benchmarking crimes, full contextual and methodological transparency, correct data handling, and unbiased comparative frameworks. Only then can evaluations of security systems and vulnerability detection tools be reliably compared, audited, and iteratively improved.

PDF Markdown Chat (Pro)

References (1)

Benchmarking Crimes: An Emerging Threat in Systems Security (2018)

Follow Topic

Get notified by email when new papers are published related to Reproducible CVE Benchmarking.