xOffense: An AI-driven autonomous penetration testing framework with offensive knowledge-enhanced LLMs and multi agent systems (2509.13021v1)

Published 16 Sep 2025 in cs.CR and cs.AI

Abstract: This work introduces xOffense, an AI-driven, multi-agent penetration testing framework that shifts the process from labor-intensive, expert-driven manual efforts to fully automated, machine-executable workflows capable of scaling seamlessly with computational infrastructure. At its core, xOffense leverages a fine-tuned, mid-scale open-source LLM (Qwen3-32B) to drive reasoning and decision-making in penetration testing. The framework assigns specialized agents to reconnaissance, vulnerability scanning, and exploitation, with an orchestration layer ensuring seamless coordination across phases. Fine-tuning on Chain-of-Thought penetration testing data further enables the model to generate precise tool commands and perform consistent multi-step reasoning. We evaluate xOffense on two rigorous benchmarks: AutoPenBench and AI-Pentest-Benchmark. The results demonstrate that xOffense consistently outperforms contemporary methods, achieving a sub-task completion rate of 79.17%, decisively surpassing leading systems such as VulnBot and PentestGPT. These findings highlight the potential of domain-adapted mid-scale LLMs, when embedded within structured multi-agent orchestration, to deliver superior, cost-efficient, and reproducible solutions for autonomous penetration testing.

Summary

The paper introduces xOffense, a novel automated penetration testing framework that leverages LLMs and multi-agent systems for improved efficiency and accuracy.
The framework decomposes tasks into phases such as reconnaissance, scanning, and exploitation using a Task Coordination Graph that dynamically manages dependencies.
Evaluation shows a sub-task completion rate of 79.17% and competitive performance improvements with Retrieval-Augmented Generation, outperforming contemporary systems.

xOffense: An AI-driven Autonomous Penetration Testing Framework

This essay discusses "xOffense," a novel framework for automated penetration testing that integrates LLMs with multi-agent systems (MAS) to improve efficiency and accuracy in vulnerability assessments. The framework aims to address limitations in current automated pentesting approaches, namely high computational costs, scalability issues, and weak reasoning capabilities. The core of xOffense is a fine-tuned, mid-scale open-source LLM, Qwen3-32B, which leverages offensive knowledge through a multi-agent architecture that decomposes tasks into distinct phases: reconnaissance, scanning, and exploitation.

Framework Architecture

The overall architecture of xOffense is designed to mimic the collaborative dynamics of human security teams by assigning specialized roles to each agent. Agents in xOffense perform targeted functions across the pentest lifecycle:

Reconnaissance Agent: Gathers intelligence on network configurations, open ports, and services using tools like Nmap and Dirb.
Vulnerability Analysis Agent: Identifies vulnerabilities using tools like Nikto and WPScan.
Exploitation Agent: Executes exploitation scripts and tests payloads using Metasploit.
Reporting Agent: Summarizes results and attack paths for analysis.

The architecture includes several key components: Task Orchestrator, Knowledge Repository, Command Synthesizer, Action Executor, and Information Aggregator. These components work together to ensure efficient task progression, precise execution, and robust information management.

Figure 1: The Overall Architecture of the xOffense Framework.

Task Coordination and Execution

xOffense employs a Task Coordination Graph (TCG) to manage dependencies and execute tasks systematically. The TCG is structured as an acyclic graph where nodes represent tasks and edges denote dependencies. Algorithms for task execution include mechanisms for feedback and dynamic plan updates.

The Check and Reflection Mechanism is integral to xOffense. This feature allows agents to self-evaluate and optimize plans based on actual execution outcomes. This continuous feedback loop helps the system adapt and correct errors autonomously.

Generative Behavior and Execution

The Command Synthesizer component translates task directives into tool-specific instructions suitable for the target environment. This precise command generation is crucial for accurate task execution. The Action Executor runs these commands in a simulated penetration testing environment, harnessing a lightweight LoRA framework for improved context handling and memory efficiency.

Evaluation and Results

xOffense was tested using two benchmarks: AutoPenBench and AI-Pentest-Benchmark. These benchmarks evaluate penetration testing performance across different categories and complexity levels. Results showed xOffense achieving a sub-task completion rate of 79.17%, outperforming contemporary systems like VulnBot and PentestGPT.

Figure 2: Task Coordination Graph (TCG) illustrating task dependencies and execution status. Completed tasks are shown in dark, the current task in orange, and pending tasks in light blue.

In real-world exploitation scenarios without Retrieval-Augmented Generation (RAG), xOffense achieved competitive performance, often surpassing other models in sub-task completion across various machines. Incorporating RAG further improved results, demonstrating the framework's ability to leverage contextually relevant external knowledge effectively.

Figure 3: Comparison of subtask completion rates across six real-world vulnerable machines in a No-RAG setting.

Figure 4: Comparison of subtask completion rates across six real-world vulnerable machines with RAG setting.

Conclusion

xOffense represents a significant advancement in automated penetration testing, combining sophisticated AI-driven reasoning with multi-agent orchestration. Its performance demonstrates the potential of domain-specific, mid-scale LLMs when embedded within structured frameworks. Future improvements will focus on optimizing command generation, enhancing long-running process handling, and extending capabilities to support advanced interactions, which may further enhance its applicability in complex cybersecurity scenarios. The findings underscore the importance of specialized model training and intelligent task orchestration in achieving efficient and effective autonomous security assessments.