Papers
Topics
Authors
Recent
2000 character limit reached

Penetration Testing Benchmark

Updated 11 December 2025
  • Penetration Testing Benchmark is a rigorously defined suite of target environments and protocols that systematically measure the effectiveness of diverse security testing approaches.
  • They employ containerized, VM-based, and orchestrated network configurations to simulate multi-stage vulnerability exploitation across varied domains.
  • Evaluation relies on quantitative metrics like success rate and time-to-exploit, enabling reproducible, statistically robust comparisons between manual and AI-driven agents.

A penetration testing (pentest) benchmark is a rigorously defined suite of target environments, scenarios, and evaluation protocols designed to systematically measure the effectiveness of human and automated actors in discovering and exploiting security vulnerabilities in IT systems. Such benchmarks form the empirical foundation for comparing penetration testing approaches—manual, scripted, and increasingly, AI-driven—and for advancing research in offensive security, autonomous agents, and security tool assessment (Happe et al., 3 May 2024, Happe et al., 14 Apr 2025, Mai et al., 11 Sep 2025).

1. Architectural Foundations and Benchmark Design Patterns

Penetration testing benchmarks exhibit substantial heterogeneity in architecture, varying across dimensions such as deployment method, vulnerability classes, scenario realism, and extensibility.

Deployment Model:

Benchmarks typically instantiate targets via containers (e.g., Docker), virtual machines (VMs), or orchestrated multi-host networks. For example, "Got Root?" provides 14 single-vulnerability Debian-based VMs, each provisioned via Vagrant and Ansible to ensure repeatability and isolation (Happe et al., 3 May 2024). "TermiBench" features 510 Dockerized hosts, with varying ratios of vulnerable and benign services, supporting both noise-free and adversarially challenging environments (Mai et al., 11 Sep 2025).

Vulnerability Scope and Scenario Typology:

Test suites range from focused, single-exploit VMs (privilege escalation only (Happe et al., 3 May 2024)) to multi-category, multi-stage exercises covering network reconnaissance, enumeration, exploitation, and post-exploitation (e.g., "AutoPenBench" (Gioacchini et al., 4 Oct 2024), PACEbench (Liu et al., 13 Oct 2025)). Scenario selection may prioritize realism via CVE instantiation, in-vitro teaching modules, multi-stage chains (Liu et al., 13 Oct 2025), or CTF-style puzzles (Muzsai et al., 2 Dec 2024, Abdulzada, 14 Jul 2025).

Extensibility and Reproducibility:

Open-source infrastructure, modular scenario addition (e.g., through Ansible roles or YAML challenge descriptors (Happe et al., 3 May 2024, Abdulzada, 14 Jul 2025)), and containerized orchestration are now standard. Leading benchmarks publish orchestration code, VM/container recipes, and explicit scripts to facilitate precise reproduction and extension (Mai et al., 11 Sep 2025, Shen et al., 7 Nov 2024).

2. Vulnerability Coverage, Task Domains, and Taxonomy

Benchmarks are classified by both technical domain and exploit complexity.

Domain coverage:

Vulnerability injection strategies:

MITRE ATT&CK mapping is frequently used to connect scenarios to established adversarial techniques (Happe et al., 3 May 2024).

3. Evaluation Metrics, Score Formulations, and Analysis

Benchmark performance is assessed via a formal suite of quantitative and qualitative metrics, defined per benchmark and scenario.

Metric Formula/Summary Reference/Paper
Success Rate (SR) $\mathrm{SR} = \frac{N_{\text{solved}}{N_{\text{total}}$ (Muzsai et al., 2 Dec 2024, Liu et al., 13 Oct 2025)
Subtask Progression Rate $P = \frac{\sum_{i=1}^{N_\mathrm{trials}\sum_{k=1}^{K} s_{i,k}{N_\mathrm{trials}K}$ (Happe et al., 14 Apr 2025)
Vulnerability Coverage $C = \frac{|V_{\mathrm{found}|}{|V_{\mathrm{total}|}$ (Liu et al., 13 Oct 2025, Shen et al., 7 Nov 2024)
Precision/Recall/F1 Precision=TPTP+FP\mathrm{Precision} = \frac{TP}{TP+FP}, Recall=TPTP+FN\mathrm{Recall} = \frac{TP}{TP+FN}, F1=2PrecRecPrec+RecF_1 = \frac{2\mathrm{Prec}\,\mathrm{Rec}}{\mathrm{Prec} + \mathrm{Rec}} (Caldwell et al., 4 Aug 2025, Potti et al., 10 Jan 2025)
Time-to-Exploit e.g. $\overline{T} = \frac{1}{N_{\text{solved}\sum_{i=1}^{N_{\text{solved} T_i$ (Abdulzada, 14 Jul 2025, Shen et al., 7 Nov 2024)
Cost (USD/token/task) Aggregates API or infra costs per run (Shen et al., 7 Nov 2024, Abdulzada, 14 Jul 2025)
Progress Rate (PR) PR=MCcompleted/MCtotal\mathrm{PR} = |M_C^{\text{completed}}| / |M_C^{\text{total}}| for command milestones (Gioacchini et al., 4 Oct 2024)

Qualitative metrics and categorical error analysis further capture failure modes, command misuse, and trace classification (syntax, semantic, environmental failures) (Happe et al., 14 Apr 2025).

Metrics are often aggregated over trials or scenarios with bootstrapped confidence intervals, enabling pairwise significance testing and benchmarking model variants (Happe et al., 14 Apr 2025).

4. Experimental Protocols, Baselines, and Agent Evaluation

Comprehensive benchmarking mandates reproducible, statistically sound experiment design, encompassing agent types, step controls, and baseline comparisons.

Canonical protocol elements:

  1. Environment instantiation: Consistent deployment of containers/VMs, full build recipes, and randomized identifiers to prevent overfitting (Happe et al., 14 Apr 2025, Mai et al., 11 Sep 2025).
  2. Trial controls: Run M models × R independent trials; enforce command/time constraints (e.g., 32 steps/trial) (Happe et al., 14 Apr 2025).
  3. Baseline selection: Human pentesters (walkthrough logs), automated tooling (Metasploit, ZAP), prior LLM-driven frameworks (e.g., PentestGPT), rule-based/random agents for lower bounds (Deng et al., 2023, Potti et al., 10 Jan 2025, Happe et al., 14 Apr 2025).
  4. Data capture: Full logging of issued commands, I/O, token usage, and system state.
  5. Measurement and significance: Calculation of per-task, per-model metrics, with bootstrapped intervals and standardized reporting (Happe et al., 14 Apr 2025).
  6. Qualitative and error analysis: Review and categorization of agent failure traces, tool misuse, and dead-ends (Muzsai et al., 2 Dec 2024, Happe et al., 14 Apr 2025).

LLM-driven agent scaffolds increasingly adopt modular architectures separating planning, command generation, and result summarization, often leveraging finite state machines (AutoPT (Wu et al., 2 Nov 2024)), multi-agent coordination (ARTEMIS (Lin et al., 10 Dec 2025)), or memory-activated design for context resilience (TermiAgent (Mai et al., 11 Sep 2025)).

5. Results from Leading Benchmarks and Comparative Insights

Benchmark-driven studies consistently reveal substantial gaps between human experts, rule-based tooling, and current autonomous or LLM-driven agents.

Agent Success and Weaknesses:

  • On real-world end-to-end benchmarks such as "TermiBench," DeepSeek V3-based TermiAgent achieved compromise of 128/230 real-world hosts, versus near-zero for earlier LLM agents (Mai et al., 11 Sep 2025).
  • AutoPT set state-of-the-art for black-box web pentesting (GPT-4o-mini, 41% CR vs 22% for ReAct), halving time and cost compared to prior work (Wu et al., 2 Nov 2024).
  • AutoPenBench’s autonomous agent solved 21% of tasks, whereas human-assisted variants reached 64% (in-vitro: 27% vs 59%) (Gioacchini et al., 4 Oct 2024).
  • In enterprise-scale environments, ARTEMIS-ensemble agents rivaled (and, in one case, outperformed) human experts in absolute vulnerability count and cost-effectiveness, but lagged in submission validity and GUI-based exploit discovery (Lin et al., 10 Dec 2025).
  • PACEbench highlighted that no model could autonomously bypass realistic cyber defense layers (WAF/IDS/IPS), underscoring an unsolved challenge (Liu et al., 13 Oct 2025).

Failure Modes:

6. Best Practices, Pitfalls, and Recommendations

Emerging consensus, crystallized in comprehensive reviews and practical guides (Happe et al., 14 Apr 2025, Abdulzada, 14 Jul 2025, Muzsai et al., 2 Dec 2024), centers on the following recurring themes:

7. Open Research Directions and Future Challenges

Despite significant progress, current penetration testing benchmarks reveal persistent limitations:

  • Long-horizon reasoning: State-of-the-art LLM agents consistently fail on deep privilege escalation, multi-stage chains, and GUI- or browser-based exploits (Isozaki et al., 22 Oct 2024, Lin et al., 10 Dec 2025).
  • Defensive evasion: No LLM-driven agent to date can reliably bypass WAF/IDS/IPS defenses when fronting real-world vulnerabilities (Liu et al., 13 Oct 2025).
  • Cost, latency, and scale: While lightweight LLMs can be viable for smaller-scale tasks, a tradeoff exists between resource tractability and generality (Mai et al., 11 Sep 2025).
  • Autonomous triage and reduction of false positives: False positive rates for AI agents remain elevated compared to experienced humans, necessitating multi-agent triage and ensemble prompt-generation (Lin et al., 10 Dec 2025).
  • Process-level, not just outcome-level, evaluation: Rubric-based, hierarchical judge systems (e.g., PentestJudge) facilitate holistic, process-oriented evaluation and enable differentiated analysis of operational objectives, security, and tradecraft (Caldwell et al., 4 Aug 2025).
  • Dynamic, stepwise reasoning and retrieval augmentation: Integrating context condensation, structured task generation, and retrieval-augmented learning are critical to improving robustness and reproducibility (Isozaki et al., 22 Oct 2024, Pratama et al., 21 Aug 2024).

A plausible implication is that future pentest benchmarks will need to encompass real-world, multi-modality workflows, combine outcome and process evaluation, and emphasize fully transparent, community-maintained scenario corpora to enable continued progress in autonomous and hybrid penetration testing research. The field currently stands at the interface between rigorous testbed engineering, statistical evaluation, and the rapid evolution of autonomous agent reasoning powered by LLMs.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Penetration Testing Benchmark.