Autonomous Penetration Testing using Reinforcement Learning (1905.05965v1)

Published 15 May 2019 in cs.CR, cs.AI, and cs.LG

Abstract: Penetration testing (pentesting) involves performing a controlled attack on a computer system in order to assess it's security. Although an effective method for testing security, pentesting requires highly skilled practitioners and currently there is a growing shortage of skilled cyber security professionals. One avenue for alleviating this problem is automate the pentesting process using artificial intelligence techniques. Current approaches to automated pentesting have relied on model-based planning, however the cyber security landscape is rapidly changing making maintaining up-to-date models of exploits a challenge. This project investigated the application of model-free Reinforcement Learning (RL) to automated pentesting. Model-free RL has the key advantage over model-based planning of not requiring a model of the environment, instead learning the best policy through interaction with the environment. We first designed and built a fast, low compute simulator for training and testing autonomous pentesting agents. We did this by framing pentesting as a Markov Decision Process with the known configuration of the network as states, the available scans and exploits as actions, the reward determined by the value of machines on the network. We then used this simulator to investigate the application of model-free RL to pentesting. We tested the standard Q-learning algorithm using both tabular and neural network based implementations. We found that within the simulated environment both tabular and neural network implementations were able to find optimal attack paths for a range of different network topologies and sizes without having a model of action behaviour. However, the implemented algorithms were only practical for smaller networks and numbers of actions. Further work is needed in developing scalable RL algorithms and testing these algorithms in larger and higher fidelity environments.

Citations (83)

View on Semantic Scholar

Summary

The paper introduces a fast, open-source Network Attack Simulator that models realistic network topologies and MDP-based penetration testing.
The research applies model-free reinforcement learning methods, including tabular Q-learning and deep Q-learning, to determine optimal attack strategies.
Experimental results show that RL agents efficiently balance action costs and performance, achieving high success rates in both multi-site and large network scenarios.

The paper presents a detailed investigation into the application of reinforcement learning (Reinforcement Learning (RL)) for automating penetration testing. The work is structured in two major parts: the development of a fast, open-source Network Attack Simulator (NAS) and the application of various RL algorithms to determine optimal attack policies within simulated network environments. The following summary outlines the key technical contributions and findings.

1. Network Attack Simulator (NAS)

The NAS is designed as a light-weight yet versatile simulation environment for network penetration testing. Its architecture is composed of a network model and an environment modeled as a Markov Decision Process (MDP). Key aspects include:

Network Model:
- Topology and Subnet Representation: The network is structured as a graph where subnets serve as vertices and inter-subnet connections (subject to firewall constraints) act as edges.
- Machine and Service Configurations: Each machine is represented by a tuple (subnet, machine ID) with associated properties such as compromise status, reachability, and a set of services. Each service is defined by its identifier, exploit success probability, and cost.
- Firewall Rules: Firewalls are modeled by specifying permitted service traffic between subnets. This abstraction allows simulation of realistic constraints such as port-based access restrictions.
MDP Formulation:
- State Space: The state is defined as the aggregate of information over machines (e.g., compromised flag, reachability, and service status—present, absent, or unknown). The state space grows exponentially with the number of machines and exploitable services.
- Action Space: Actions include scanning (deterministic) and exploiting (non-deterministic, subject to probabilistic success and cost constraints) all machines on the network.
- Reward Function: Rewards are computed as the value of newly compromised machines minus the cost of executed actions, thus encouraging not only the breaching of sensitive nodes but also cost efficiency.
- Transition Dynamics: The simulator embeds uncertainty in the outcomes of exploit actions while incorporating connectivity constraints and firewall policies.
Performance Benchmarks:
- The simulator demonstrates significant speed improvements over virtual machine–based testing. For instance, load times scale linearly with the number of machines and services—with a worst-case scenario of approximately 3.5 seconds for 1000 machines and 1000 services—while action execution rates reach up to 17,000 actions per second in favorable settings.

2. Reinforcement Learning for Automated Pentesting

The paper models automated pentesting as an MDP with unknown transition dynamics, thereby motivating model-free RL approaches. Three distinct RL methods are explored:

Tabular Q-Learning with Epsilon-Greedy Action Selection:
- The algorithm uses a hashmap to store state-action values and updates these according to the standard Q-learning update rule.
- The epsilon ( $\epsilon$ )-decay schedule is used to balance exploration and exploitation.
Tabular Q-Learning with Upper Confidence Bound (UCB) Action Selection:
- In addition to Q-value estimation, visit counts for each state-action pair are maintained and incorporated into the selection process to favor less frequently explored actions.
Deep Q-Learning (DQL):
- A single hidden-layer neural network is employed to approximate the Q-function, with experience replay and a separate target network to enhance training stability.
- The state is encoded as a one-dimensional vector comprising information (compromise status, reachability, and known service status) for each machine.

Experimental Evaluation:

The paper evaluates the RL agents on several network configurations, including a standard network topology, single-site, and multi-site network scenarios. Key experimental observations include:

Convergence and Optimality:
- All RL variants are capable of finding attack policies that compromise sensitive machines while minimizing action costs.
- Convergence is measured in terms of mean episodic reward approaching the theoretical maximum (computed as total sensitive machine value minus the minimum cumulative cost).
- On certain topologies (e.g., multi-site networks), the DQL algorithm achieves faster convergence compared to tabular methods in terms of episodes, although tabular approaches complete many more episodes per unit time due to reduced computational overhead.
Scaling with Network Size and Action Space:
- When scaling the number of machines, performance drops rapidly due to the exponential state space growth. Tabular methods exhibit a steep decline past network sizes of approximately 40 machines, while DQL shows a more gradual performance degradation provided sufficient training time.
- Increasing the number of exploitable services affects DQL more adversely than tabular methods, likely due to the increased output dimension required by the neural network and a corresponding slowdown in the exploration rate.
Comparative Metrics:
- Metrics such as the proportion of evaluation runs that successfully exploited all sensitive machines (solved proportion), maximum rewards achieved, and training episodes per fixed time period were recorded.
- The DQL algorithm achieved 100% solved proportion in some scenarios, with consistent maximum rewards, albeit with slower per-episode throughput compared to tabular methods.

3. Discussion and Future Work

The experimental results substantiate that RL can autonomously learn effective penetration testing strategies in a simulated environment, even without prior knowledge of the exploit outcome model. However, limitations are discussed:

Scalability:
- The exponential growth of the state-action space in large networks and with many exploits presents significant challenges, particularly for tabular approaches.
- The authors highlight the potential for hierarchical RL or problem decomposition (separating network-level targeting from machine-level exploitation) as promising avenues for scalability improvement.
Fidelity versus Versatility Trade-off:
- While the NAS offers rapid simulation for algorithm development, its abstract representation does not capture all nuances of real-world network dynamics.
- Future research is proposed to integrate higher-fidelity environments (e.g., virtualized machine networks) once the RL algorithms demonstrate sufficient robustness.
Extension of RL Algorithms:
- Further development of scalable, efficient function-approximation techniques (potentially beyond single-layer neural networks) is advocated to handle the complexity of real-world exploit databases and network sizes.

Conclusion

The paper demonstrates that model-free RL methodologies, including both tabular and deep Q-learning approaches, are capable of autonomously learning optimal attack paths in simulated penetration testing environments. The integration of a custom-designed NAS provides a controllable benchmark platform, facilitating rapid experimentation and performance evaluation. Although scalability remains a critical hurdle, the research lays a solid foundation for future work aimed at extending RL-based automated pentesting to larger and more realistic network settings.

PDF Markdown

Autonomous Penetration Testing using Reinforcement Learning (1905.05965v1)

Summary

Related Papers