TermiBench: Autonomous Pentest Benchmark

Updated 15 September 2025

TermiBench is an open-source, agent-oriented benchmark that simulates realistic network conditions for autonomous penetration testing.
It comprises 510 target hosts with heterogeneous mixtures of vulnerable and benign services, enforcing authentic reconnaissance and exploitation cycles.
The integrated TermiAgent framework mitigates challenges like long-context forgetting and exploit variability, achieving significant efficiency and robustness gains.

TermiBench is an open-source, agent-oriented benchmark specifically designed for real-world automated penetration testing. It provides a fine-grained evaluation platform where the objective is full system compromise—achieving an interactive system shell—rather than traditional CTF-style “flag finding.” TermiBench is architected to accurately reflect operational conditions by removing prior knowledge, introducing heterogeneous benign services, and requiring agents to perform authentic reconnaissance, service discrimination, and robust exploit execution in uncontrolled environments.

1. Benchmark Construction and Organization

TermiBench consists of 510 distinct target hosts, each configured to reflect real-world network conditions. Hosts are built with 30 unique CVEs spanning 25 software services over a ten-year vulnerability window (2015–2025). Each host includes a mix of vulnerable and benign background services, selected from a set of 14 mainstream applications such as sshd, vsftpd, mysql, and nginx. The benchmark is stratified into five tiers based on the number of benign services present:

Tier	Vulnerable Services	Benign Services	Hosts per Tier
0	1	0	[exact count in paper]
1	1	1	...
2	1	3	...
3	1	5	...
4	1	7	...

This layered design compels agents to perform accurate attack surface identification under realistic service noise and forces a full reconnaissance/exploitation cycle in each scenario.

2. Assessment Paradigm and Evaluation Metrics

Existing pentesting agent evaluations typically use oversimplified CTF setups in which solutions are guided by prior knowledge such as credential leaks, explicit service versions, or fixed exploit paths. TermiBench eliminates such confounding factors. Agents launched on TermiBench are given only minimal starting information (usually a subnet), requiring them to autonomously:

Enumerate all services on each host,
Discriminate benign from exploitable services,
Choose and execute suitable exploits, and
Demonstrate system ownership by establishing a remote shell.

Performance is quantified by:

Number of successful penetrations (shell and root shell acquisition),
Execution time (wall-clock and LLM token-based),
Financial cost (in LLM API usage).

A reduction paper presents direct comparisons, demonstrating that agents with prior hints dramatically outperform those starting blind—TermiBench thereby exposes the gap in genuine autonomous capability.

3. Unique Challenges in Automated Real-World Penetration Testing

Penetration testing agents in realistic environments are subject to two principal challenges:

Long-Context Forgetting: As agents conduct multi-step reconnaissance and exploitation over extended timeframes, standard LLMs lose context necessary for effective decision-making. TermiBench systematically reveals this by requiring attack sequences that exceed the short-term memory of traditional agent designs.

Exploit Arsenal Reliability: Exploits sourced from public repositories are structurally heterogeneous, inconsistently documented, and non-uniform in dependencies. Simple retrieval-based strategies are unreliable, often failing in real deployment due to incompatible environments or missing operational details.

TermiBench directly targets these issues by forcing agents to act without enumerated paths and requiring effective memory management and structured exploit integration.

4. The TermiAgent Framework

TermiAgent is a multi-agent system engineered to meet TermiBench’s stringent requirements. It introduces several key modules:

Reasoner Module: High-level planning, detecting successful system compromise and setting phased operational goals.
Assistant Module: Converts phased goals to actionable, stepwise instructions; interacts with the memory system to verify ongoing attack progress.
Executor Module: Runs instructions and relays raw execution output.
Memory Module: Implements a Penetration Memory Tree (hierarchical context organization), with a Located Memory Activation mechanism that selectively reactivates previously collected evidence relevant to the current phase, thereby mitigating long-context forgetting.
Arsenal Module: Standardizes public exploits using the Unified Exploit Descriptor (UED), abstracting over ten semantic and operational dimensions (language, version, dependencies, scripts, setup and exploit steps, usage examples) to support robust, reproducible engagement.

This modular architecture enables fine-grained task decomposition, active context filtering, and reliable exploit execution—all validated under diverse operational conditions.

5. Technical Specifications: Penetration Memory and Exploit Integration

The Penetration Memory Tree structures all collected information hierarchically by attack phase. Each agent interaction involves backward traversals of this tree, reactivating only those observations directly relevant to the current step. This reduces irrelevant context load and supports decisions in high-churn attack environments. The UED artifact ensures that exploits—irrespective of their source—are transformed into agent-ready modules by encoding:

Programming language and version,
Base image and system dependencies,
Entry scripts and parameters,
Setup and execution instructions.

This descriptor format allows dynamic exploit containerization and operationalization, overcoming the brittleness of public exploit scripts.

6. Experimental Outcomes and Comparisons

In direct evaluation, TermiAgent outperforms contemporary agents including PentestGPT, VulnBot, and in-benchmark baselines. On CTF-style tasks with hints, TermiAgent achieves ~1.7× greater shell acquisition. In strict real-world scenarios without extra hints, it obtains >8× as many system shells as VulnBot. Execution time is reduced to less than 1/5 and financial cost to 1/10 vs. VulnBot in these conditions. TermiAgent maintains robustness across LLM configurations, often matching performance when deployed on laptop-class hardware.

7. Practical Applications and Implications

TermiBench and TermiAgent establish a benchmark and agent suite closely aligned to practical enterprise needs. Key implications include:

Feasibility of fully autonomous penetration testing with only network location inputs,
Substantial reduction in manual engagement, operational expense, and elapsed testing times,
Compatibility with consumer hardware (laptops, smartphones) by virtue of lightweight LLM requirements,
Framework extensibility for dynamic exploit arsenal updates and real-world scenario simulation.

This confirms that TermiBench and TermiAgent address enduring obstacles in pentesting automation, delivering objective evaluation for scalable AI-driven security assessment.

8. Context within Agent-Oriented Security Research

TermiBench’s contribution lies in its departure from traditional flag-centric benchmarks toward an ownership-centric approach. By integrating a complex mixture of services and vulnerabilities and enforcing realistic operational constraints, it provides a more rigorous and representative platform for penetration agent research. The agent framework supplies methodological innovations in context management and exploit execution not found in prior approaches. This positions TermiBench as an objective standard for evaluating and developing next-generation AI penetration tools in both research and industrial settings (Mai et al., 11 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing (2025)

Follow Topic

Get notified by email when new papers are published related to TermiBench.