AutoPenBench: Generative Pen Testing Benchmark
- AutoPenBench is a benchmark framework that evaluates generative AI agents for automated penetration testing using containerized environments and milestone-based metrics.
- It employs both synthetic and real-world tasks with diverse vulnerabilities to measure agent success via granular metrics like Success Rate and Progress Rate.
- The framework supports comparisons between fully autonomous and semi-autonomous architectures and promotes community-driven enhancements for evolving cybersecurity challenges.
AutoPenBench is an open benchmark framework designed to systematically evaluate generative AI agents for automated penetration testing. Developed to address the lack of standardized assessment tools for LLM-driven cybersecurity agents, AutoPenBench incorporates containerized environments, diverse real and synthetic vulnerable systems, granular milestone-based metrics, and the capacity to compare both fully autonomous and human-assisted agent architectures under transparent, reproducible conditions (Gioacchini et al., 4 Oct 2024).
1. Benchmarking Architecture and Infrastructure
AutoPenBench utilizes a Docker-based virtualization strategy to instantiate both agent workstations and vulnerable systems within a shared, isolated network environment. The agent workstation is typically provisioned with Kali Linux and an arsenal of pentesting tools (Nmap, Metasploit, Hydra, etc.). Each benchmark run involves the agent interacting over network protocols (SSH, HTTP, etc.) against targets designed as Capture-the-Flag (CTF) challenges. A milestone-based assessment framework divides each penetration task into:
- Command milestones (): discrete actions or tool invocations (e.g., scanning, exploitation)
- Stage milestones (): higher-level phases (e.g., infiltration, privilege escalation, flag capture)
Evaluation output consists of a binary Success Rate (SR: flag captured or not) and a Progress Rate (PR: proportion of milestones achieved).
2. Task Design and Categories
AutoPenBench comprises 33 tasks partitioned into two major classes:
- In-vitro tasks (22): Synthetically structured scenarios covering foundational cybersecurity domains:
- Access Control (AC): privilege escalation, misconfigured permissions
- Web Security (WS): path traversal, SQL injection, RCE
- Network Security (NS): port scanning, MITM attacks
- Cryptography (CRPT): brute-forcing, cryptographic exploitation
- Real-world tasks (11): Challenges modeled on authentic CVEs (2014–2024; CVSS 7.5–10.0), e.g., exploitation of Heartbleed (CVE-2014-0160), Spring4Shell (CVE-2022-22965), GeoServer RCE (CVE-2024-36401).
Each task stipulates "gold steps" (optimal command path) and annotated milestone counts, enabling reproducibility and fine-grained agent capability measurement.
3. Evaluation Metrics and Analysis
Agents in AutoPenBench are scored by:
- Success Rate (SR):
- Progress Rate (PR):
- Milestone Analysis: PR and SR are stratified by stage to determine bottlenecks—whether agents struggle in discovery, exploitation, or post-exploitation.
Results show fully autonomous agents’ average SR at 21% (27% in in-vitro, 9% in real-world), while semi-autonomous (human-in-the-loop) agents achieve 64% overall (73% real-world) by mitigating context drift and focusing sequential steps.
4. Agent Architectures: Autonomous vs Assisted
Two agent designs are benchmarked:
- Fully Autonomous Agents: These use a reasoning loop analogous to ReAct, decomposed into summary (context condensation), thought (reasoning), and action (command emission) procedures. JSON-based output ensures structured execution traceability. Empirical observations indicate limited contextual persistence and exploitation accuracy, especially in complex tasks.
- Semi-Autonomous Agents: Human assistance segments objectives into sub-tasks, with the agent generating step reports, and the operator steering subsequent actions. This mitigates context accumulation and substantially improves PR and SR. Assisted agents adapt strategy dynamically and reset working memory at each sub-task boundary.
5. Impact of LLM Choice and Configuration
AutoPenBench exposes pronounced differences among LLMs:
- GPT-4o achieves 100% SR on simple access control, outperforming many others.
- GPT-4-turbo and Gemini Flash manifest SRs of 40% and 0% respectively in the same setting.
- Models such as o1-preview are heavily jailbreak-constrained, delivering minimal PR (e.g., ~12.5%).
- Structured action output (JSON) is essential—models with unreliable formatting (e.g., GPT-4o-mini) are handicapped in evaluation.
This underscores that both model architecture and practical interfacing (formatting, prompt engineering) directly influence agent effectiveness across penetration stages.
6. Extensibility, Community Involvement, and Future Directions
AutoPenBench's open-source release (https://github.com/lucagioacchini/auto-pen-bench) is intended as a living platform:
- Expansion planned with additional vulnerable scenarios and new attack vectors.
- Retrieval-augmented generation (RAG) modules are in development, enabling context-aware enhancement via best-practices documentation and cybersecurity manuals.
- The framework encourages new agent architectures, flexible milestone definitions, and community-driven scenario templates.
- Broader LLM inclusion, prompt optimizations, and evaluation strategies are anticipated, promoting longitudinal and generalizable benchmarking.
A plausible implication is that AutoPenBench will act as a central resource for benchmarking generative agent progress in offensive security, enabling grounded comparison and annual community-led leaderboards.
7. Comparative Landscape and Scientific Significance
AutoPenBench fills the existing gap in pentesting agent evaluation, offering:
- Transparent, reproducible, multi-granular benchmarking across synthetic and real-world vulnerabilities.
- Direct comparison of agent, model, and milestone performance, facilitating robust ablation and interpretability studies.
- A standardized experimental environment that promotes fair advancement of autonomous cybersecurity AI.
The milestone-based evaluation framework, dual agent architecture analysis, and detailed LLM effect studies represent a significant step toward rigorous, domain-grounded assessment of generative penetration testing agents (Gioacchini et al., 4 Oct 2024).