Papers
Topics
Authors
Recent
2000 character limit reached

MCAS: Multi Cyber Agent Simulator

Updated 27 December 2025
  • MCAS is a multi-agent simulation framework enabling cyber attacker–defender interactions via a decentralized, partially observable decision process.
  • It represents networks as symbolic property sets and employs guarded actions with clear pre- and post-conditions to mimic real-world cyber tactics.
  • Experimental evaluations in MCAS utilize diverse agent strategies and reinforcement learning to assess coordination effectiveness and cyber battle outcomes.

The Multi Cyber Agent Simulator (MCAS) is a turn-based, multi-agent simulation environment for modeling battles between autonomous cyber-attacker and cyber-defender agents on virtualized networked infrastructures. Developed by Soulé et al., MCAS is formulated as a decentralized, partially observable Markov decision process (Dec-POMDP), with agents operating under partial observability, sequentially executing guarded actions over host node property spaces, and receiving observation- and metric-derived scalar rewards specific to their roles. MCAS supports scenario-agnostic experimental studies in realistic cyber-attack and defense coordination, featuring extensible symbolic property systems and agent behaviors, and is implemented atop the PettingZoo multi-agent reinforcement learning (RL) framework (Soulé et al., 5 Jun 2025).

1. System Architecture and Representation

MCAS abstracts the network as a collection of nodes, each represented by a "bag" of symbolic properties Pnj={pl=(idl,vl)}P_{nj} = \{ p_l = (\text{id}_l, v_l) \}, where each property consists of an identifier from a global namespace and a value in the set VV. Typical node properties include operating system characteristics, privilege levels, file presence, etc.

Agents, partitioned into attacker and defender sets (Ag={ag1,...,agm}Ag = \{ ag_1, ..., ag_m \}), are allocated visibility over specific nodes. Each agent possesses its own local observation space Ωi⊂P\Omega_i \subset P and an action set AiA_i defining possible interventions. An agent's behavior policy πi\pi_i maps local histories (observation and reward sequences) to action choices.

MCAS advances the global state through an agent–environment cycle: agents act in a fixed turn order, receiving observations and rewards, selecting actions, which are then applied by the environment to update the global network state, recalculate metrics, and hand control to the next agent.

2. Decentralized Partially Observable Markov Decision Process (Dec-POMDP) Formalism

Formally, MCAS instantiates a Dec-POMDP (S,{Ai},T,R,{Ωi},O)(S, \{A_i\}, T, R, \{\Omega_i\}, O) defined as follows:

  • State space SS: The overall system state is the union of all properties across nodes, s⊂Ps \subset P.
  • Actions: Agents act sequentially. Each action aa is a guarded property update, defined as a pre-condition set PpreaP_{pre}^a and a post-condition set PpostaP_{post}^a. The transition kernel T(s,a,s′)T(s, a, s') enforces that Pprea⊂sP_{pre}^a \subset s; if so, properties matching PpostaP_{post}^a identifiers are updated, and PpostaP_{post}^a is added to the state. Non-deterministic transitions are supported by assigning probabilities <1<1 to action outcomes.
  • Observations: After each transition, agent agiag_i receives an observation oi⊂Ωio_i \subset \Omega_i deterministically if oi⊂s′o_i \subset s', modeling local, property-based partial observability.
  • Rewards: After all agent actions in a round, domain metrics Metrics(s,a)∈Rn\text{Metrics}(s, a) \in \mathbb{R}^n (e.g., number of compromised nodes, spyware presence, patch coverage) are computed. A linear or nonlinear evaluation yields separate attacker (rar^a) and defender (rdr^d) scalar rewards: R(s,a)=(rd,ra)=Eval(Metrics(s,a))R(s, a) = (r^d, r^a) = \text{Eval}(\text{Metrics}(s, a)).

3. Agent Capabilities, Action Modeling, and Knowledge

Each agent in MCAS is characterized by its:

  • Knowledge: Defines which property identifiers from the global set IDID are observable by the agent (Ωi\Omega_i).
  • Capabilities: Restricts which actions (from the global action catalog) the agent can attempt (Ai⊂ActionA_i \subset \text{Action}).

Actions correspond to pre–post property transformations, manually constructed and annotated with logical pre-conditions (e.g., "requires root privilege, requires knowledge of host IP") and post-conditions (e.g., addition of "host compromised" or removal of "vulnerable_service"). The MITRE ATT&CK framework inspired a repertoire of approximately 30 attacker (e.g., port scan, lateral movement, exfiltrate DB data, install spyware) and 30 defender atomic actions (e.g., detect process, apply patch, quarantine host).

4. Agent–Environment Cycle and Sequential Turn Order

MCAS operationalizes the joint Dec-POMDP with a synchronous sequential turn mechanism:

  1. Initialized in state sts_t, each agent sequentially receives its most recent observation ot(i)o_t^{(i)} and reward rt(i)r_t^{(i)}.
  2. Agent agiag_i selects an action ai=Ï€i(o0:t(i),r0:t(i))a_i = \pi_i(o_{0:t}^{(i)}, r_{0:t}^{(i)}).
  3. The environment applies aia_i, transitions to st+1s_{t+1} according to TT.
  4. Agent receives updated observation ot+1(i)o_{t+1}^{(i)}.
  5. This process repeats for all agents within a round.
  6. At round end, new metrics Metrics(st+1,(a1…am))\text{Metrics}(s_{t+1}, (a_1 \ldots a_m)) are computed, and the evaluation function produces rewards.

Although concurrency is theoretically possible, the authors explicitly selected a sequential cycle for implementation simplicity in PettingZoo. This design means all agent coupling occurs indirectly via state manipulation and resource contention in the shared environment (Soulé et al., 5 Jun 2025).

5. Experimental Scenario and Agent Behavior Evaluation

For empirical demonstration, MCAS was instantiated on a network topology modeling a small company comprising five subnets (Outside, DMZ, ACC, MAR, SRV) and eight host nodes (two attackers, four DMZ servers, three workstations, printer, API, database, domain controller). Attackers are initialized on external nodes; defenders are initially present on the web server and database host.

Attacker objectives are twofold: (i) exfiltrate confidential records from the database, and (ii) install spyware on the printer server. The attack–defense interaction graph is derived from an attack–defense tree built upon MITRE ATT&CK tactics (e.g., initial access, credentials, lateral movement, exfiltration) and mapped to a set of atomic actions and dependencies.

Three behavioral baselines were assessed over 1000 episodes:

  • Random agents: Uniform action selection results in negligible attack success.
  • Decision-tree agents: Handcrafted deterministic policies enable attackers to achieve objectives in fixed 16-step traces in the absence of defenders; defenders enforcing rule-based countermeasures reliably prevent success.
  • Multi-agent Q-learning (MARL) with curriculum: Attackers first train in isolation to nearly optimal policy, followed by defender introduction. Attacker rewards decrease as defenders learn effective mitigation. Metrics tracked include per-episode average reward, empirical attainment of objectives, and solution path lengths. Notably, the attacker reward curve shows marked growth in the attacker-only phase, plateauing as defenders are introduced (Soulé et al., 5 Jun 2025).

6. Extensibility, Implementation, and Limitations

MCAS is engineered for scenario-agnostic deployment. Network topologies, node property sets, agent definitions, and guarded actions are loaded modularly via JSON. This extensibility supports:

  • Automated scenario generation from external vulnerability/mitigation corpora (e.g., directly mining MITRE ATT&CK to populate nodes, actions, and metrics).
  • Substitution of advanced agent policies (e.g., policy gradient, actor–critic, multi-agent planning, or explicit collaboration protocols) to study coordination.
  • Extension to richer stochastic transition kernels with explicit costs or partial action success.
  • Bridging with real/emulated testbeds through policy transfer or replay.

Limitations identified include the labor-intensive nature of manual scenario encoding, potential under-representation of genuinely concurrent cyber events due to sequential turn execution, lack of explicit resource and cost modeling (all properties are atomic), and the current absence of explicit inter-agent communication or negotiation.

7. Summary and Significance

MCAS provides a modular, Dec-POMDP-based simulation infrastructure for detailed study of attacker–defender interactions in symbolic networks. Its design supports plug-and-play agent and scenario specification, empirical evaluation of strategic behaviors, and domain-agnostic experimentation. By operationalizing the key abstractions of state as symbolic property sets, action as guarded transformations, observation as localized property views, and reward as scenario-grounded metric evaluation, MCAS advances the empirical and methodological toolkit for multi-agent cyber research (Soulé et al., 5 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Multi Cyber Agent Simulator (MCAS).