MCAS: Multi Cyber Agent Simulator

Updated 27 December 2025

MCAS is a multi-agent simulation framework enabling cyber attacker–defender interactions via a decentralized, partially observable decision process.
It represents networks as symbolic property sets and employs guarded actions with clear pre- and post-conditions to mimic real-world cyber tactics.
Experimental evaluations in MCAS utilize diverse agent strategies and reinforcement learning to assess coordination effectiveness and cyber battle outcomes.

The Multi Cyber Agent Simulator (MCAS) is a turn-based, multi-agent simulation environment for modeling battles between autonomous cyber-attacker and cyber-defender agents on virtualized networked infrastructures. Developed by Soulé et al., MCAS is formulated as a decentralized, partially observable Markov decision process (Dec-POMDP), with agents operating under partial observability, sequentially executing guarded actions over host node property spaces, and receiving observation- and metric-derived scalar rewards specific to their roles. MCAS supports scenario-agnostic experimental studies in realistic cyber-attack and defense coordination, featuring extensible symbolic property systems and agent behaviors, and is implemented atop the PettingZoo multi-agent reinforcement learning (RL) framework (Soulé et al., 5 Jun 2025).

1. System Architecture and Representation

MCAS abstracts the network as a collection of nodes, each represented by a "bag" of symbolic properties $P_{nj} = \{ p_l = (\text{id}_l, v_l) \}$ , where each property consists of an identifier from a global namespace and a value in the set $V$ . Typical node properties include operating system characteristics, privilege levels, file presence, etc.

Agents, partitioned into attacker and defender sets ( $Ag = \{ ag_1, ..., ag_m \}$ ), are allocated visibility over specific nodes. Each agent possesses its own local observation space $\Omega_i \subset P$ and an action set $A_i$ defining possible interventions. An agent's behavior policy $\pi_i$ maps local histories (observation and reward sequences) to action choices.

MCAS advances the global state through an agent–environment cycle: agents act in a fixed turn order, receiving observations and rewards, selecting actions, which are then applied by the environment to update the global network state, recalculate metrics, and hand control to the next agent.

2. Decentralized Partially Observable Markov Decision Process (Dec-POMDP) Formalism

Formally, MCAS instantiates a Dec-POMDP $(S, \{A_i\}, T, R, \{\Omega_i\}, O)$ defined as follows:

State space $S$ : The overall system state is the union of all properties across nodes, $s \subset P$ .
Actions: Agents act sequentially. Each action $a$ is a guarded property update, defined as a pre-condition set $P_{pre}^a$ and a post-condition set $P_{post}^a$ . The transition kernel $T(s, a, s')$ enforces that $P_{pre}^a \subset s$ ; if so, properties matching $P_{post}^a$ identifiers are updated, and $P_{post}^a$ is added to the state. Non-deterministic transitions are supported by assigning probabilities $<1$ to action outcomes.
Observations: After each transition, agent $ag_i$ receives an observation $o_i \subset \Omega_i$ deterministically if $o_i \subset s'$ , modeling local, property-based partial observability.
Rewards: After all agent actions in a round, domain metrics $\text{Metrics}(s, a) \in \mathbb{R}^n$ (e.g., number of compromised nodes, spyware presence, patch coverage) are computed. A linear or nonlinear evaluation yields separate attacker ( $r^a$ ) and defender ( $r^d$ ) scalar rewards: $R(s, a) = (r^d, r^a) = \text{Eval}(\text{Metrics}(s, a))$ .

3. Agent Capabilities, Action Modeling, and Knowledge

Each agent in MCAS is characterized by its:

Knowledge: Defines which property identifiers from the global set $ID$ are observable by the agent ( $\Omega_i$ ).
Capabilities: Restricts which actions (from the global action catalog) the agent can attempt ( $A_i \subset \text{Action}$ ).

Actions correspond to pre–post property transformations, manually constructed and annotated with logical pre-conditions (e.g., "requires root privilege, requires knowledge of host IP") and post-conditions (e.g., addition of "host compromised" or removal of "vulnerable_service"). The MITRE ATT&CK framework inspired a repertoire of approximately 30 attacker (e.g., port scan, lateral movement, exfiltrate DB data, install spyware) and 30 defender atomic actions (e.g., detect process, apply patch, quarantine host).

4. Agent–Environment Cycle and Sequential Turn Order

MCAS operationalizes the joint Dec-POMDP with a synchronous sequential turn mechanism:

Initialized in state $s_t$ , each agent sequentially receives its most recent observation $o_t^{(i)}$ and reward $r_t^{(i)}$ .
Agent $ag_i$ selects an action $a_i = \pi_i(o_{0:t}^{(i)}, r_{0:t}^{(i)})$ .
The environment applies $a_i$ , transitions to $s_{t+1}$ according to $T$ .
Agent receives updated observation $o_{t+1}^{(i)}$ .
This process repeats for all agents within a round.
At round end, new metrics $\text{Metrics}(s_{t+1}, (a_1 \ldots a_m))$ are computed, and the evaluation function produces rewards.

Although concurrency is theoretically possible, the authors explicitly selected a sequential cycle for implementation simplicity in PettingZoo. This design means all agent coupling occurs indirectly via state manipulation and resource contention in the shared environment (Soulé et al., 5 Jun 2025).

5. Experimental Scenario and Agent Behavior Evaluation

For empirical demonstration, MCAS was instantiated on a network topology modeling a small company comprising five subnets (Outside, DMZ, ACC, MAR, SRV) and eight host nodes (two attackers, four DMZ servers, three workstations, printer, API, database, domain controller). Attackers are initialized on external nodes; defenders are initially present on the web server and database host.

Attacker objectives are twofold: (i) exfiltrate confidential records from the database, and (ii) install spyware on the printer server. The attack–defense interaction graph is derived from an attack–defense tree built upon MITRE ATT&CK tactics (e.g., initial access, credentials, lateral movement, exfiltration) and mapped to a set of atomic actions and dependencies.

Three behavioral baselines were assessed over 1000 episodes:

Random agents: Uniform action selection results in negligible attack success.
Decision-tree agents: Handcrafted deterministic policies enable attackers to achieve objectives in fixed 16-step traces in the absence of defenders; defenders enforcing rule-based countermeasures reliably prevent success.
Multi-agent Q-learning (MARL) with curriculum: Attackers first train in isolation to nearly optimal policy, followed by defender introduction. Attacker rewards decrease as defenders learn effective mitigation. Metrics tracked include per-episode average reward, empirical attainment of objectives, and solution path lengths. Notably, the attacker reward curve shows marked growth in the attacker-only phase, plateauing as defenders are introduced (Soulé et al., 5 Jun 2025).

6. Extensibility, Implementation, and Limitations

MCAS is engineered for scenario-agnostic deployment. Network topologies, node property sets, agent definitions, and guarded actions are loaded modularly via JSON. This extensibility supports:

Automated scenario generation from external vulnerability/mitigation corpora (e.g., directly mining MITRE ATT&CK to populate nodes, actions, and metrics).
Substitution of advanced agent policies (e.g., policy gradient, actor–critic, multi-agent planning, or explicit collaboration protocols) to study coordination.
Extension to richer stochastic transition kernels with explicit costs or partial action success.
Bridging with real/emulated testbeds through policy transfer or replay.

Limitations identified include the labor-intensive nature of manual scenario encoding, potential under-representation of genuinely concurrent cyber events due to sequential turn execution, lack of explicit resource and cost modeling (all properties are atomic), and the current absence of explicit inter-agent communication or negotiation.

7. Summary and Significance

MCAS provides a modular, Dec-POMDP-based simulation infrastructure for detailed study of attacker–defender interactions in symbolic networks. Its design supports plug-and-play agent and scenario specification, empirical evaluation of strategic behaviors, and domain-agnostic experimentation. By operationalizing the key abstractions of state as symbolic property sets, action as guarded transformations, observation as localized property views, and reward as scenario-grounded metric evaluation, MCAS advances the empirical and methodological toolkit for multi-agent cyber research (Soulé et al., 5 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

Towards a Multi-Agent Simulation of Cyber-attackers and Cyber-defenders Battles (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Multi Cyber Agent Simulator (MCAS).