PACEbench: Autonomous Cyber-Exploitation Benchmark

Updated 15 October 2025

PACEbench is a benchmark that rigorously evaluates autonomous cyber-exploitation using real-world vulnerabilities, environmental complexity, and active defenses.
It defines graduated scenario types—ranging from single vulnerability tests (A-CVE) to advanced WAF-protected challenges (D-CVE)—for practical assessment.
PACEagent, the modular AI framework integrated with PACEbench, enhances performance through LLM-based planning, tool integration, and persistent multi-stage context.

PACEbench is a benchmark and evaluation framework aimed at rigorously assessing the practical autonomous cyber-exploitation capabilities of AI agents, with a particular focus on LLMs and autonomous agent architectures. It is designed to overcome the limitations of traditional cybersecurity benchmarks, which typically employ artificial, toy problems and fail to capture the complexity, uncertainty, and adversarial conditions of real-world penetration testing. By introducing scenarios with graduated difficulty, environmental realism, and defensive countermeasures, PACEbench provides a structured platform for measuring current and future AI models’ aptitude for both targeted and generalized cyber offense.

1. Foundational Principles and Framework Design

PACEbench is built on three foundational pillars: vulnerability difficulty, environmental complexity, and cyber defense.

Vulnerability Difficulty: Scenario vulnerabilities are mapped to real-world Common Vulnerabilities and Exposures (CVEs) with empirically measured difficulty, such as human penetration tester pass rates ranging from 30% to 86%. These metrics anchor the benchmark to practical, documented exploits rather than artificially constructed flaws.
Environmental Complexity: Instead of assuming universal target vulnerability (a common flaw in existing benchmarks), PACEbench features realistic multi-host network environments containing both benign and compromised hosts. This configuration forces agents to perform reconnaissance and correctly discriminate targets, introducing noise and increasing challenge.
Cyber Defense: Active countermeasures, such as state-of-the-art production Web Application Firewalls (WAFs), are deployed to simulate adversarial defense and test the agent's ability to execute adversarial bypass or zero-day logic exploitation.

The framework explicitly moves beyond “presumption of guilt” and isolated single-vulnerability CTF challenges, introducing scenarios that require multi-stage attacks, chain exploitation, lateral movement, and defense evasion.

2. Scenario Taxonomy and Technical Challenges

PACEbench comprises four scenario types, each incrementally increasing in difficulty and complexity:

Scenario Type	Key Features	Example Challenge
A-CVE	Single vulnerability on an isolated host, based on real CVEs	SQLi, RCE with quantified pass rates
B-CVE	Blended network: mix of benign and compromised hosts	Distinguishing vulnerable hosts
C-CVE	Chained exploitation, allows pivoting for lateral network movement	Multi-stage attacks and chain logic
D-CVE	Known vulnerability protected by production-grade WAF (e.g. ModSecurity CRS, Naxsi, Coraza)	WAF bypass or zero-day exploitation

A-CVE: The agent must successfully exploit a single, known CVE in an isolated setting, thereby testing baseline exploit capability.
B-CVE: The agent operates in an environment with multiple hosts, some benign, requiring comprehensive reconnaissance and analysis to filter false positives.
C-CVE: This scenario emulates full penetration testing, requiring the agent to use an exploited system as an entry point for further attacks—demanding strategic planning for sequential, multi-host exploitation.
D-CVE: The most difficult, involving defense-protected vulnerabilities, demanding WAF bypass or exploitation of a fundamentally novel flaw.

Environmental configuration parameters (e.g., B1, BK, BN for the diversity of host types) further modulate challenge difficulty.

3. Autonomous Agent Architecture: PACEagent

To address the intricacies of these scenarios, the PACEbench team introduces PACEagent, which emulates the practice and workflow of skilled human penetration testers using a modular, multi-phase architecture.

LLM Core: Responsible for mission interpretation, planning, and high-level reasoning, leveraging LLM capabilities for strategic command generation.
Tool Module: Employs a tool router with the Model Context Protocol (MCP) for adaptive invocation of command-line utilities and external cybersecurity tools (e.g., Burp Suite). This enables granular control over exploitation, reconnaissance, and analysis tasks.
Memory Module: Maintains a persistent and structured context of actions, observations, and reasoning. By aggregating intermediate state, it enables robust execution of long-horizon multi-step operations in complex, multi-host scenarios.

PACEagent executes in a loop divided into three stages: reconnaissance, analysis, and exploitation. An agent server orchestrates this loop, managing tasks, logging performance, and enabling automated benchmarking.

4. Experimental Results and Quantitative Evaluation

Evaluation of seven frontier LLMs (including both proprietary and open-source models) reveals marked deficiencies in current autonomous cyber-exploitation capability, especially as scenario complexity increases.

A-CVE Results: Leading models such as Claude-3.7-Sonnet demonstrate moderate success in isolated vulnerability exploitation, with benchmark scores like 0.241 validating such performance in simple environments.
B-CVE & C-CVE Results: Performance degrades noticeably in blended and chained scenarios, where agents must conduct reconnaissance, resolve uncertainty, and execute multi-stage attacks. Model action often fails to scale to these realistic tasks.
D-CVE Results: In scenarios defended by state-of-the-art WAFs, none of the tested models (nor their agent wrappers) succeeded in bypassing the deployed countermeasures, highlighting a current inability to autonomously mount practical cyber offense.

PACEagent achieves substantial performance improvements over competing agent frameworks (e.g., CAI), with 65.2% higher overall benchmark scores at an approximately 28% increase in token usage. This suggests that modular agent configuration, multi-phase operational logic, and persistent memory materially impact autonomous exploitation outcomes.

Aggregate scoring across tasks uses a weighted sum:

$\text{BenchScore} = \text{A\_score} \cdot w_A + \text{B\_score} \cdot w_B + \text{C\_score} \cdot w_C + \text{D\_score} \cdot w_D$

where:

$A_\text{score} = \sum_{i=1}^{17} A_i$ (17 A-CVE tasks),
$B_\text{score} = \sum_{j=1}^{7} B_j$ (7 B-CVE tasks),
$C_\text{score} = \sum_{k=1}^{5} C_k$ (5 C-CVE tasks),
$D_\text{score}$ is the score for D-CVE, with weights $w_A = 0.2$ , $w_B = 0.3$ , $w_C = 0.3$ , $w_D = 0.2$ .

This weighted aggregation ensures normalized evaluation across challenge breadth and intrinsic scenario complexity.

5. Limitations and Implications

Empirical results indicate that present-day LLM-based agents are not yet a generalized threat in cyber offense—they lack the facility to autonomously operate in complex, realistic, multi-host networks, and cannot circumvent modern defensive countermeasures. Key limitations include:

Insufficient long-horizon planning and strategic chaining of exploits.
Inability to reliably distinguish benign from vulnerable hosts in noisy environments.
Ineffectiveness against active defenses such as WAFs, with no success in zero-day or evasive exploitation.

A plausible implication is that trajectory improvements in LLM contextual memory, tool integration logic, and sequential planning will be required for future models to approach practical penetration testing capability.

PACEbench offers the means to monitor these improvements responsibly and to identify emergent risk vectors before wider deployment.

6. Benchmark Impact and Role in AI Cybersecurity Research

PACEbench diverges substantially from traditional cyber challenge paradigms (e.g., CTF), which are criticized for artificial simplicity and unrealistic assumptions. Its commitment to real-world vulnerabilities, dynamic environmental complexity, and adversarial defensive measures contributes to more rigorous, risk-relevant evaluation of autonomous AI models and agent architectures.

The benchmark provides a controlled, empirical substrate for tracking advancement in autonomous cyber-exploitation and informing trustworthy development in AI for penetration testing, vulnerability management, and cyber-defense research. Given prevailing performance gaps, it also reflects a degree of reassurance on immediate cyber risk from contemporary LLM deployments.

PACEbench thus serves both as a litmus test for practical AI capabilities in cyber offense and as a safety net for secure AI development and deployment in sensitive domains.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to PACEbench.