Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks (2502.04227v3)

Published 6 Feb 2025 in cs.CR

Abstract: Enterprise penetration-testing is often limited by high operational costs and the scarcity of human expertise. This paper investigates the feasibility and effectiveness of using LLM-driven autonomous systems to address these challenges in real-world Active Directory (AD) enterprise networks. We introduce a novel prototype designed to employ LLMs to autonomously perform Assumed Breach penetration-testing against enterprise networks. Our system represents the first demonstration of a fully autonomous, LLM-driven framework capable of compromising accounts within a real-life Microsoft Active Directory testbed, GOAD. We perform our empirical evaluation using five LLMs, comparing reasoning to non-reasoning models as well as including open-weight models. Through quantitative and qualitative analysis, incorporating insights from cybersecurity experts, we demonstrate that autonomous LLMs can effectively conduct Assumed Breach simulations. Key findings highlight their ability to dynamically adapt attack strategies, perform inter-context attacks (e.g., web-app audits, social engineering, and unstructured data analysis for credentials), and generate scenario-specific attack parameters like realistic password candidates. The prototype exhibits robust self-correction mechanisms, installing missing tools and rectifying invalid command generations. We find that the associated costs are competitive with, and often significantly lower than, those incurred by professional human pen-testers, suggesting a path toward democratizing access to essential security testing for organizations with budgetary constraints. However, our research also illuminates existing limitations, including instances of LLM ``going down rabbit holes'', challenges in comprehensive information transfer between planning and execution modules, and critical safety concerns that necessitate human oversight.

Summary

The paper demonstrates that LLM-powered systems can autonomously perform penetration testing on Active Directory networks with competitive cost efficiency.
The methodology deploys multiple LLM configurations within the GOAD testbed to simulate realistic enterprise network interactions and dynamic attack strategies.
Results reveal that reasoning LLMs consistently outperform non-reasoning models in executing effective network penetration, highlighting promising automation benefits.

Can LLMs Hack Enterprise Networks? Autonomous Assumed Breach Penetration-Testing Active Directory Networks

This paper investigates the feasibility and effectiveness of using LLM-driven systems to autonomously conduct penetration-testing on Active Directory enterprise networks, specifically focusing on Assumed Breach scenarios. In this context, the authors introduce "cochise," an LLM-based prototype designed to perform autonomous penetration-testing tasks within a realistic Microsoft Active Directory testbed. This investigation seeks to mitigate challenges associated with traditional penetration-testing, namely high operational costs and reliance on scarce human expertise.

Autonomous Penetration-Testing Framework

The framework presented utilizes several LLMs to execute penetration-testing tasks. The authors benchmark the effectiveness of reasoning and non-reasoning LLMs within this domain. The prototype is evaluated within "A Game of Active Directory" (GOAD) testbed, which simulates genuine enterprise network environments complete with intricate interactions and potential operational risks. The testbed emulates regular network activities by multiple users, providing a realistic backdrop against which the prototype's capabilities are assessed.

Figure 1: Simplified System-Diagram of the used ``A Game of Active Directory'' (GOAD) Testbed highlighting attack paths and vulnerabilities seen during prototype runs.

Evaluation and Results

The study compares five LLM configurations, assessing their ability to adapt attack strategies dynamically, execute inter-context attacks, and generate scenario-specific attack parameters. Results indicate that reasoning LLMs tend to demonstrate higher consistency in attack execution and greater efficacy in penetrating networks than their non-reasoning counterparts. Notably, these LLM configurations achieve competitive cost efficiency compared to professional human penetration testers, paving the way for democratized access to essential security testing.

Figure 2: Attack Vectors pursued by the different LLMs. For each attack vector, we detail the percentage of runs in which the respective attack vector was included.

Key Challenges and Implications

The research delineates several challenges intrinsic to LLM-driven security testing. Issues such as the LLMs "going down rabbit holes," where models hyper-focus on specific attack paths while neglecting others, and incomplete information transfer between planning and execution modules are identified. These highlight areas for improvement in LLM-based security automation.

The authors also emphasize ethical considerations inherent in security tool applications, advocating for transparency and open-source dissemination to foster collective cybersecurity improvements.

Future Research Opportunities

Future research should focus on enhancing information transfer mechanisms within LLM architectures and mitigating the hyper-focus phenomenon observed in LLMs during test runs. Additionally, increased safety guardrails to prevent LLM misuse can be incorporated into subsequent implementations.

Figure 3: Our sampling runs were time-capped at two hours, making the efficacy of time spent by the LLMs of high importance. The graphs show in which areas the different LLMs spent their time in, and how long singular LLM queries took.

Conclusion

This work establishes a foundational framework for LLM-driven cybersecurity automation in Active Directory networks, demonstrating both potential and current limitations. The research underscores the promise held by domain-agnostic LLM-driven architectures in improving autonomous system capabilities across broader sectors of software engineering. The findings contribute to a burgeoning body of knowledge that aims to balance technological innovation with ethical responsibility.