TermiAgent Framework: Autonomous Pentesting
- TermiAgent Framework is a multi-agent autonomous penetration testing system designed to overcome CTF benchmark limitations by targeting full system control.
- It employs a five-module pipeline with advanced memory management and a structured exploit arsenal to facilitate blind reconnaissance and adaptive exploit execution.
- Large-scale evaluations on TermiBench show TermiAgent achieves significantly higher success rates, reduced time, and lower costs compared to similar agents.
TermiAgent is a multi-agent autonomous penetration testing framework designed to address the deficiencies of AI-based pentesting agents evaluated under unrealistic capture-the-flag (CTF) benchmarks. TermiAgent, in conjunction with the TermiBench benchmark, targets the acquisition of full system control (‘shell or nothing’) in real-world attack surfaces, introducing methodological advances in agent memory management and exploit arsenal construction to match the demands of practical, end-to-end penetration testing (Mai et al., 11 Sep 2025).
1. Motivation and Problem Definition
Recent AI-based penetration testing agents (e.g., PentestGPT, VulnBot) have been primarily evaluated on CTF-style environments. These environments are intrinsically biased: they embed prior knowledge (e.g., leaked credentials, fixed vulnerability locations, single-vulnerable-service VMs) and lack environmental ‘noise’, thus understating true operational complexity. This results in agents that do not generalize to blind, real-world settings where service enumeration, autonomous reconnaissance, and robust exploit execution are mandatory. The TermiAgent framework is created to address:
- Blind-start reconnaissance, where agents have initial access only to an IP or subnet.
- Multi-service noise, where up to eight services per host obfuscate the presence of a single vulnerable target.
- End-to-end full system control as the success criterion, rather than simple ‘flag finding’.
- Long attack chains that strain LLM context and the orchestration of complex exploit logic.
2. The TermiBench Benchmark
TermiBench provides an evaluation environment to properly assess automated agents under these constraints.
| Tier | Benign Services | Vulnerable Services | Hosts per Tier |
|---|---|---|---|
| 0 | 0 | 1 | 30 |
| 1 | 1 | 1 | 120 |
| 2 | 3 | 1 | 120 |
| 3 | 5 | 1 | 120 |
| 4 | 7 | 1 | 120 |
- Hosts are drawn from a pool of 510 VMs, spanning 30 real-world remote code execution (RCE) CVEs (2015–2025) and 25 distinct services. Background services are representative of the 14 most prevalent Internet applications.
- Success is defined as obtaining an interactive shell, with an optional metric for root shells.
- Measured metrics include (i) success rate, (ii) time per successful exploit, and (iii) financial cost calculated as LLM API token consumptions times per-token pricing.
TermiBench thus enforces multi-stage, system-level reasoning and resilience against context dilution.
3. TermiAgent Framework Architecture
TermiAgent is structured as a five-module multi-agent pipeline organized in a perception–action loop:
- Reasoner Module: High-level planner. Ingests global goals and abstract memory, emits phased sub-goals (e.g., port scans, version fingerprinting, exploit selection).
- Assistant Module: Ground-level executor. Converts phased goals and relevant exploit manuals into concrete CLI commands.
- Executor Module: Operates commands on a Kali VM, delivering stdout/stderr to memory.
- Memory Module: Implements a hierarchical Penetration Memory Tree (PMT), tracking the state as host→service→exploit. Applies Located Memory Activation (LMA) to prevent LLM forgetting.
- Arsenal Module: Hosts a curated exploit arsenal, aggregating Dockerized in-the-wild GitHub PoCs and Metasploit modules with programmatically generated manuals and container descriptors.
Feedback between modules proceeds via hierarchical memory updates and dynamically activated context, aligning agent attention with the current attack stage.
4. Located Memory Activation Mechanism
A principal challenge in LLM-driven pentesting is long-context forgetting, wherein the LLM discards earlier, crucial outputs (e.g., fingerprinting scans) as the context window fills.
TermiAgent models agent state as a Penetration Memory Tree , with each node corresponding to the current (HOST, SERVICE, EXPLOIT) trajectory. Memory entries for nodes along the path are stored, compressed to one of three granularities (fine , coarse , abstract ):
where is the compression operator at level for agent role (Reasoner/Assistant). Only decision-relevant context along the active tree path is injected into LLM prompts, substantially mitigating context drop-off.
As each command executes, its outcome is indexed at the relevant leaf of the PMT, and subsequent reasoning activates only the needed ancestry for the next decision. Successes or failures dynamically expand the tree, supporting structured branch enumeration and recovery.
5. Structured Exploit Arsenal Construction
Traditional, naive retrieval of public exploits frequently fails due to environment misconfiguration, missing dependencies, or poor documentation. TermiAgent’s Arsenal Module reconceptualizes exploit acquisition as a code-understanding and software packaging problem:
- Candidate repositories per CVE are mined from GitHub.
- Repos are parsed into Abstract Syntax Trees (ASTs) or manifests.
- Each repo is transformed via into a Unified Exploit Descriptor (UED) capturing >10 key dimensions: language, version, Docker base, dependencies, main path, argument files, setup steps, invocation, etc.
- For each descriptor:
- A Dockerfile environment is synthesized, e.g.:
1 2 3 4 5
FROM python:3.9-slim RUN apt-get update && apt-get install -y {system_deps} COPY . /exploit && WORKDIR /exploit RUN pip install {code_deps} ENTRYPOINT ["python3", "main_script.py"] - A concise manual specifying parameters and sample invocation is generated.
- A Dockerfile environment is synthesized, e.g.:
Formally: This structured approach yields high exploit reliability and reproducibility and enables plug-and-play integration with the Assistant Module.
6. Evaluation and Performance Characteristics
Key findings from large-scale evaluation:
| Agent | LLM Backend | CTF pass@5 (33 tasks) | Real-world shells (230 hosts) |
|---|---|---|---|
| TermiAgent | DeepSeek-V3 | 15 / 33 | 128 / 46 root |
| VulnBot | DeepSeek-V3 | 9 / 33 | 15 / 9 |
| PentestGPT | DeepSeek-V3 | 4 / 33 | 0 / 0 |
- TermiAgent secures 1.7× more CTF flags and >8× more real shells than VulnBot when backends are matched.
- On real-world, multi-service hosts, VulnBot achieves <10% success, whereas TermiAgent exceeds 50% on all evaluated LLM backends.
- Time and cost efficiency: On shared successful CTFs, TermiAgent averages 19.6 min and \$0.055 per exploit (VulnBot: 17.6 min, \$0.058); in real-world settings, it uses only 18.7% of VulnBot’s time and 7.4% of its financial cost per shell.
- Ablation analysis:
- Removing the Arsenal Module reduces success by ~29.7%.
- Eliminating Located Memory Activation reduces success by ~67% (most pronounced with maximal service ‘noise’).
- Omitting crucial UED dimensions (e.g., Docker base images) drastically lowers output success rate (e.g., 89.1% → 25.1%).
7. Practical Constraints, Deployment, and Broader Implications
TermiAgent operates effectively using both closed and open LLMs (GPT-5, DeepSeek-V3, Qwen3-30B/8B/4B/1.7B), with no need for LLM gradient fine-tuning—prompt engineering suffices. Deployment is viable on laptop-scale hardware by leveraging lighter backends and containerized exploit execution. The Arsenal Module comprises 1,378 Dockerized RCE CVE exploits plus 1,077 Metasploit modules.
Limitations include:
- Incomplete support for complex web applications (HTML/JS flows, authentication, file upload features).
- Failure on PoCs with fragmented or interactive codebases not amenable to automation.
- Ethical risks arise since TermiAgent bypasses LLM 'safety' features to perform real exploitation; evaluation is performed in isolated labs and the tool is not openly distributed without safeguards.
Integration into real penetration testing workflows is facilitated by its hardware-efficient design. It can pre-scan large subnets and triage host risks as a force-multiplier for human red teams.
TermiAgent, in conjunction with TermiBench, represents a paradigm shift for autonomous pentesting evaluation and methodology: mandating system-level goals, introducing environmentally realistic ambiguity, and solving intrinsic LLM context limitations by structured memory activation. Evaluation underscores substantial improvements in both attack success rates and operational efficiency compared to earlier agent frameworks. Further research opportunities include support for richer web application interaction, post-exploit lateral movement, and defensive counter-automation.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free