TermiBench: Real-world Autonomous Pentesting
- TermiBench is a real-world, agent-centric benchmark that shifts from flag capture to full system compromise using interactive shells.
- It simulates diverse Linux-based environments with 510 hosts featuring 30 CVEs across five noise tiers to mimic operational complexity.
- Its evaluation protocols use metrics like ESR, RSR, time-to-shell, and cost, demonstrating TermiAgent’s significant performance gains over traditional methods.
TermiBench denotes the first real-world, agent-centric benchmark for autonomous penetration testing, developed to address the limitations of prior capture-the-flag (CTF)–oriented evaluations in producing realistic performance estimates of AI-based penetration testing agents. Unlike legacy CTF benchmarks that focus primarily on finding “flags” within highly simplified environments, TermiBench adopts the more rigorous end-goal of actual system compromise, requiring agents to autonomously achieve interactive shells on target hosts that more accurately represent operational complexity in practice (Mai et al., 11 Sep 2025).
1. Design Principles and Scope
TermiBench fundamentally shifts the evaluation criterion from flag discovery to full system control, specifically by rewarding agents that obtain interactive shells, with additional merit given for achieving root access. The benchmark encompasses:
- Hosts and CVEs: 510 Linux-based hosts, each running a distinct instance of one of 30 real-world CVEs. These vulnerabilities span public disclosures from 2015–2025 and affect 25 heterogeneous services including Elasticsearch, ProFTPD, Joomla, ActiveMQ, phpMyAdmin, and more.
- Service and Environmental Complexity: Each host is augmented with a variable number of “benign” noise services (drawn from a pool of 14 common applications, such as sshd, vsftpd, mysql) to simulate operationally noisy environments. Complexity is parameterized in five tiers:
- Tier 0: 1 vulnerable + 0 benign (30 hosts)
- Tier 1: 1 vulnerable + 1 benign (120 hosts)
- Tier 2: 1 vulnerable + 3 benign (120 hosts)
- Tier 3: 1 vulnerable + 5 benign (120 hosts)
- Tier 4: 1 vulnerable + 7 benign (120 hosts)
- Target Topology: All environments utilize Linux containerization (Debian/Ubuntu variants) and operate over simulated flat /24 subnets. Agents are provided with only the subnet as input; no privileged knowledge is supplied. Windows services and proprietary hypervisor features are excluded, focusing the evaluation on open-source, cross-service interaction challenges.
2. Environment Instantiation and Agent Workflow
Hosts are instantiated via Docker orchestration (Docker Compose or Python scripts), which configure the requisite vulnerable and benign services for each scenario. Automated routines ensure live status of all network endpoints prior to agent interaction.
The agent workflow mirrors professional penetration testing procedures and is mandated to operate with full autonomy:
- Reconnaissance: Network scanning (e.g., nmap) identifies open ports and infers service versions.
- Service Enumeration: Services are fingerprinted using HTTP banners or protocol-specific probes, categorized as benign or potentially vulnerable.
- Vulnerability Correlation: Discovered versions are mapped against a curated CVE inventory. Exploit modules are retrieved from Metasploit or from UED-packaged Docker Proof-of-Concepts (PoCs).
- Exploitation: The agent deploys relevant exploits, aiming to establishing interactive shells (reverse or bind shells). Post-exploitation privilege escalation is attempted for root access.
Agents receive no entry-point hints or privileged information beyond the network subnet and must run without human intervention, constrained only by a liberal timeout, resource budgets suitable for laptop-scale GPUs, and monetary tracking of LLM-based inference costs.
3. Evaluation Protocols and Metrics
TermiBench introduces a rigorous evaluation suite with several quantitative and composite measures:
- Exploit Success Rate (ESR):
- Root Shell Rate (RSR):
- Time-to-Shell ():
Average:
- Financial Cost ():
Average:
- Composite Score (example):
Performance is benchmarked by comparing several baseline systems:
| Agent | Shells Obtained (Total) | Root Shells (Total) | Average Time (min) | Average Cost (\$) |
|---|---|---|---|---|
| TermiAgent | 118 | 56 | 11.79 (σ ≈ 2.3) | 0.0074 |
| VulnBot | 5 | 0 | 63.14 (σ ≈ 10.7) | 0.0996 |
| PentestGPT | 0 | 0 | – | – |
TermiAgent demonstrates an ESR and RSR two orders of magnitude higher than CTF-style counterparts under real-world, multi-service noise, with a substantial reduction in time-to-shell and cost.
4. Distribution and Reproducibility
The benchmark and supporting artifacts are publicly available via Zenodo (https://doi.org/10.5281/zenodo.16962513). Repository structure includes:
hosts/: Dockerfiles and configurations for each CVEbenign/: Dockerfiles for benign servicescompose/: Docker Compose templates for each environment tierverify/: Shell-detection scripts and Ansible playbooksdocs/: Usage instructions, CVE maps, and network setup guides
Deployment involves standard container orchestration tools. Users clone the repository, configure tier counts in compose/config.yml, instantiate via docker-compose up --build, and agents commence pentesting against the advertised subnet. YAML templates and a unified make bench-up script reproduce isolated networks and ensure precise versioning of all targets.
5. Empirical Results and Observed Challenges
CTF-based agents such as VulnBot and PentestGPT substantively underperform on TermiBench. Shell acquisition rates fall below 10% for these agents despite successful CTF performance. Notably, success for VulnBot drops by 66–71% upon removal of entry-point or exploit-path hints, revealing a fundamental reliance on externally-supplied guidance.
Baseline agent performance degrades by more than 50% from the lowest- to highest-noise tiers due to severe confusion during service enumeration. The lack of “plug-and-play” exploits in real-world CVEs is pronounced; the inclusion of 1,378 UED-packaged Docker PoCs expands exploit coverage by a factor of 1.8 over Metasploit alone.
High-noise tiers (five and seven benign services) present the most substantial challenges, particularly when targeting services with non-standard exploit chains such as GeoServer, HugeGraph, and OFBiz, which demand structured UED-based manuals; naive retrieval approaches are inadequate in these cases.
Even with lightweight LLMs (e.g., Qwen3–4B), advanced agents such as TermiAgent maintain high practical utility, compromising 137 of 230 hosts under evaluation, compared to minimal success by VulnBot.
6. Broader Implications and Adoption
TermiBench provides a new standard for realism and rigor in the evaluation of autonomous penetration testing agents. By requiring shell compromise rather than flag acquisition, it closely models industry-relevant outcomes. The strong performance of memory-activated agents like TermiAgent—capable of over 50% shell success even on consumer hardware—suggests a significant advance in real-world AI-driven pentesting capability, scalable to laptop-grade infrastructure.
A plausible implication is the accelerated development of more robust, context- and memory-aware pentesting agents leveraging TermiBench’s reproducible, open-source evaluation environment. The benchmark enables direct, apples-to-apples comparison across research efforts under strictly controlled, real-world aligned conditions, thereby enhancing confidence in reported advances in automated security assessment methodologies (Mai et al., 11 Sep 2025).