- The paper introduces PACEbench, a benchmark that evaluates AI agents' cyber-exploitation capabilities using real-world scenarios with varying complexity.
- It details PACEagent, a modular framework integrating an LLM core, tool module, and memory module for phased and strategic penetration testing.
- Experimental results show that while closed-source LLMs perform better than open-source ones, none can bypass advanced defenses like production-grade WAFs.
PACEbench: A Comprehensive Benchmark for Evaluating Practical AI Cyber-Exploitation Capabilities
Motivation and Benchmark Design Principles
The increasing autonomy and tool-use capabilities of LLMs have raised concerns about their potential for automating sophisticated cyber offense. Existing evaluation frameworks, primarily based on CTF-style challenges, are limited by their artificial simplicity and lack of real-world complexity. PACEbench addresses these deficiencies by introducing a benchmark grounded in three core principles: vulnerability difficulty, environmental complexity, and the presence of cyber defenses. This design enables a more realistic and granular assessment of LLM-driven agents' cyber-exploitation capabilities.
PACEbench comprises four escalating scenarios:
- A-CVE: Single, real-world CVE exploitation on a single host, with difficulty quantified by human pass rates.
- B-CVE: Blended environments with both vulnerable and benign hosts, requiring reconnaissance and target discrimination.
- C-CVE: Chained exploitation across multiple hosts, necessitating lateral movement and privilege escalation.
- D-CVE: Exploitation in the presence of production-grade WAFs, demanding defense evasion or novel bypass techniques.
Figure 1: An overview of PACEbench. In this benchmark, an agent's score is a function of both task-specific difficulty and the complexity of the scenario, which scales from isolated vulnerabilities to complex environments.
Compared to prior benchmarks, PACEbench introduces blended and defended scenarios, simulating the uncertainty and layered defenses of real-world networks.
Figure 2: Comparison of cybersecurity benchmarks. PACEbench (center) incorporates complex elements like a WAF and multiple hosts, offering a more realistic simulation than traditional CTFs (right).
PACEagent: Architecture and Workflow
To effectively tackle the challenges posed by PACEbench, the authors introduce PACEagent, a modular agent framework that emulates the operational workflow of human penetration testers. The architecture consists of:
- LLM Core: Responsible for high-level reasoning, strategic planning, and phase management (reconnaissance, analysis, exploitation).
- Tool Module: Orchestrates both local and external cybersecurity tools via a tool router and the Model Context Protocol (MCP).
- Memory Module: Maintains a summarized history of interactions, enabling long-horizon reasoning and efficient context management.
The agent operates in a loop, iteratively analyzing the environment, planning actions, executing tools, and updating memory until objectives are met or a step limit is reached.
Figure 3: The architecture of the PACEagent framework, highlighting the phase manager, tools router, and memory module as key enhancements for cybersecurity operations.
This structured, multi-phase approach enables more robust exploration and exploitation in complex, multi-stage environments.
Experimental Evaluation
Setup
Seven LLMs (four proprietary: Claude-3.7-Sonnet, Gemini-2.5-Flash, GPT-5-mini, o4-mini; three open-source: Deepseek-v3, Deepseek-r1, Qwen3-32B) were evaluated using both PACEagent and the CAI agent framework. The primary metric is the PACEbench score, a weighted sum of success rates across all four scenario types, using a Pass@5 criterion.
Results and Analysis
- Overall Performance: No model achieved a PACEbench score above 0.241 (Claude-3.7-Sonnet). All models failed to bypass any WAF in D-CVE scenarios.
- Vulnerability Difficulty: Success rates decline as CVE difficulty increases (measured by human pass rate). Notably, some vulnerabilities difficult for humans were solved by agents, likely due to LLMs' ability to rapidly generate and test payloads.

Figure 4: Count of successful exploiting model across CVE difficulty levels, as measured by human pass rate.
- Environmental Complexity: Introduction of benign hosts (B-CVE) and chained attack paths (C-CVE) significantly degraded agent performance. Agents often failed at reconnaissance, lateral movement, or privilege escalation steps.
- Cyber Defenses: No model succeeded in bypassing any WAF-protected scenario, indicating a current inability to autonomously defeat standard cyber defenses.
Figure 5: In the C-CVE setup, the agent must pivot from the front network to the internal network, simulating realistic lateral movement constraints.
- Closed vs. Open-Source Models: Closed-source models outperformed open-source counterparts, primarily due to larger context windows and more advanced capabilities. Open-source models were bottlenecked by context length, failing in multi-stage tasks.
Figure 6: Performance of PACEagent across challenges in PACEbench. Green: Pass@5; Orange: partial success; Red: failure.
- Agent Architecture Comparison: PACEagent outperformed CAI by 65.2% in total PACEbench score, at the cost of 28% higher token usage. The structured, multi-phase workflow and MCP integration were critical for improved performance.
Implications and Future Directions
The results demonstrate that current LLMs do not pose a generalized autonomous cyber offense threat. Even the best-performing models are limited to isolated, simple vulnerabilities and are unable to handle realistic, multi-stage attacks or bypass modern defenses. This provides a clear baseline for tracking future advances and highlights the need for continued monitoring as LLM capabilities evolve.
From a practical perspective, PACEbench offers a robust methodology for pre-deployment risk assessment of LLM-driven agents in cybersecurity contexts. The modular design of PACEagent, particularly its phase management and memory mechanisms, provides a blueprint for developing more capable and auditable autonomous agents.
Theoretically, the findings suggest that scaling LLMs alone is insufficient for mastering complex cyber exploitation; advances in long-horizon planning, tool integration, and context management are required. The dual-use dilemma is underscored: while current models are not yet a major threat, future improvements could rapidly change the risk landscape, necessitating proactive governance and ethical research focus.
Future work should expand PACEbench to include binary exploitation, increase the diversity and scale of vulnerabilities, and further investigate the integration of advanced safety mechanisms in LLMs.
Conclusion
PACEbench establishes a new standard for evaluating the practical cyber-exploitation capabilities of AI agents, grounded in real-world complexity and defense scenarios. The empirical results highlight the current limitations of LLM-driven agents and provide a rigorous framework for tracking progress and ensuring the safe deployment of future AI systems in cybersecurity domains.