Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PentestGPT: An LLM-empowered Automatic Penetration Testing Tool (2308.06782v2)

Published 13 Aug 2023 in cs.SE and cs.CR

Abstract: Penetration testing, a crucial industrial practice for ensuring system security, has traditionally resisted automation due to the extensive expertise required by human professionals. LLMs have shown significant advancements in various domains, and their emergent abilities suggest their potential to revolutionize industries. In this research, we evaluate the performance of LLMs on real-world penetration testing tasks using a robust benchmark created from test machines with platforms. Our findings reveal that while LLMs demonstrate proficiency in specific sub-tasks within the penetration testing process, such as using testing tools, interpreting outputs, and proposing subsequent actions, they also encounter difficulties maintaining an integrated understanding of the overall testing scenario. In response to these insights, we introduce PentestGPT, an LLM-empowered automatic penetration testing tool that leverages the abundant domain knowledge inherent in LLMs. PentestGPT is meticulously designed with three self-interacting modules, each addressing individual sub-tasks of penetration testing, to mitigate the challenges related to context loss. Our evaluation shows that PentestGPT not only outperforms LLMs with a task-completion increase of 228.6\% compared to the \gptthree model among the benchmark targets but also proves effective in tackling real-world penetration testing challenges. Having been open-sourced on GitHub, PentestGPT has garnered over 4,700 stars and fostered active community engagement, attesting to its value and impact in both the academic and industrial spheres.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Gelei Deng (35 papers)
  2. Yi Liu (543 papers)
  3. Víctor Mayoral-Vilches (14 papers)
  4. Peng Liu (372 papers)
  5. Yuekang Li (34 papers)
  6. Yuan Xu (122 papers)
  7. Tianwei Zhang (199 papers)
  8. Yang Liu (2253 papers)
  9. Martin Pinzger (15 papers)
  10. Stefan Rass (35 papers)
Citations (50)

Summary

The paper introduces PentestGPT, an automated penetration testing tool leveraging LLMs to enhance system security. It addresses the limitations of traditional manual penetration testing, which requires extensive expertise and is labor-intensive. The authors observe that while LLMs demonstrate proficiency in sub-tasks such as using testing tools and interpreting outputs, they struggle with maintaining a comprehensive understanding of the overall testing scenario.

The authors make the following contributions:

  • A comprehensive penetration testing benchmark
  • An empirical evaluation of LLMs for penetration testing tasks
  • An innovative LLM-powered penetration testing system

The paper details the development of PentestGPT, designed with three self-interacting modules—Reasoning, Generation, and Parsing—to mitigate context loss challenges. The system's effectiveness is evaluated through benchmark tests and real-world penetration testing scenarios.

Background and Related Work

The paper reviews penetration testing methodologies, highlighting their importance in identifying and mitigating security vulnerabilities. It notes the shift towards offensive security strategies, where security teams attempt to breach defenses to uncover vulnerabilities, offering advantages over traditional defensive mechanisms. The discussion covers the limitations of existing automated penetration testing pipelines due to the need for comprehensive knowledge and strategic planning. It also examines recent advancements in LLMs and their potential applications in cybersecurity, including code analysis and vulnerability repair.

Penetration Testing Benchmark

The paper identifies limitations in existing penetration testing benchmarks, such as restricted scope and failure to assess progressive accomplishments. To address these limitations, the authors construct a new benchmark with diverse tasks, varying difficulty levels, and progress tracking. The benchmark includes test machines from HackTheBox and VulnHub, encompassing 13 targets with 182 sub-tasks covering the OWASP (Open Worldwide Application Security Project) top 10 vulnerabilities.

Exploratory Study

The authors conduct an exploratory paper to evaluate the capabilities of LLMs in real-world penetration testing tasks. The paper uses GPT-3.5, GPT-4, and Bard to assess their performance against the developed benchmark. The testing strategy involves an interactive loop where the LLM is given a penetration testing goal, asked for the appropriate operation to execute, implementing it in the testing environment, and feeding the test outputs back to the LLM for next-step reasoning.

The paper reveals that LLMs are proficient in specific sub-tasks, such as utilizing testing tools, interpreting outputs, and suggesting subsequent actions. However, they struggle with maintaining a coherent grasp of the overall testing scenario, often losing sight of earlier discoveries. LLMs are also found to overemphasize recent tasks, neglecting potential attack surfaces exposed in prior tests.

Methodology: PentestGPT Design

To address the limitations identified in the exploratory paper, the authors present PentestGPT, an interactive system designed to enhance the application of LLMs in penetration testing. PentestGPT is inspired by collaborative dynamics in real-world human penetration testing teams and is tailored to manage large and intricate projects.

PentestGPT features a tripartite architecture comprising Reasoning, Generation, and Parsing Modules. The Reasoning Module maintains a high-level overview of the penetration testing status using a novel representation called the Pentesting Task Tree (PTT), which encodes the testing process's ongoing status and steers subsequent actions. The PTT is based on the cybersecurity attack tree. The Generation Module constructs detailed procedures for specific sub-tasks, translating them into exact testing operations. The Parsing Module condenses and emphasizes text data encountered during penetration testing, extracting essential information.

The PTT is defined mathematically as an attributed tree G=(V,E,λ,μ)G=(V,E,\lambda,\mu) where:

  • VV is a set of nodes (or vertices)
  • EE is a set of directed edges
  • λ:EΣ\lambda:E\to\Sigma is an edge labeling function assigning a label from the alphabet Σ\Sigma to each edge
  • μ:(VE)×KS\mu:(V \cup E)\times K \to S is a function assigning key(from K)-value(from S) pairs of properties to the edges and nodes.

The PTT is also defined as a pair T=(N,A)T = (N, A), where:

  • NN is a set of nodes organized in a tree structure, with each node having a unique identifier and a root node with no parent.
  • AA is a function that assigns to each node nNn \in N a set of attributes A(n)A(n), where each attribute is a pair (a,v)(a, v) with aa as the attribute name and vv as the attribute value.

Evaluation and Results

The authors evaluate PentestGPT using the developed benchmark, demonstrating significant performance gains. PentestGPT achieves a 228.6% increase in sub-task completion compared to the direct usage of GPT-3.5 and a 58.6% increase compared to GPT-4. Additionally, PentestGPT is applied to the HackTheBox active penetration testing machines challenge, completing 4 out of 10 selected targets at a total OpenAI Application Programming Interface (API) cost of $131.5, ranking among the top 1% of players in a community of over 670,000 members.

The evaluation underscores PentestGPT's practical value in enhancing the efficiency and precision of penetration testing tasks. The solution is made publicly available on GitHub, receiving widespread acclaim and fostering active community engagement.

Evaluation Research Questions

The authors set out to answer the following research questions in their evaluation:

  • RQ3 (Performance): How does the performance of PentestGPT compare with that of native LLM models and human experts?
  • RQ4 (Strategy): Does PentestGPT employ different problem-solving strategies compared to those utilized by LLMs or human experts?
  • RQ5 (Ablation): How does each module within PentestGPT contribute to the overall penetration testing performance?
  • RQ6 (Practicality): Is PentestGPT practical and effective in real-world penetration testing tasks?

Results

The results of the evaluation are:

  • PentestGPT-GPT-4 surpasses the other three solutions, successfully solving 6 out of 7 easy difficulty targets and 2 out of 4 medium difficulty targets.
  • Both PentestGPT-GPT-3.5 and PentestGPT-GPT-4 perform better than the standard utilization of LLMs.
  • PentestGPT decomposes the penetration testing task in a manner akin to human experts.
  • PentestGPT can pinpoint potential sub-tasks likely to lead to successful outcomes.
  • PentestGPT still prioritizes brute-force attacks before vulnerability scanning, which is not typically done by human experts.
  • PentestGPT struggles to interpret images and cannot grasp certain social engineering tricks and subtle cues.
  • PentestGPT demonstrates superiority over the three ablation baselines regarding overall target and sub-task completion.
  • PentestGPT completes three easy and five medium challenges in the HackTheBox active machine challenges. The total expenditure for this exercise amounts to $131.5, averaging$21.92 per target.

Conclusion

The paper concludes by highlighting the potential of LLMs in penetration testing and introducing PentestGPT as a specialized tool that simulates human-like behavior. The tool's design, inspired by real-world penetration testing teams, enables a divide-and-conquer approach to problem-solving. The paper's contributions serve as a valuable resource and offer a promising direction for continued research and development in cybersecurity.

Reddit Logo Streamline Icon: https://streamlinehq.com