Papers
Topics
Authors
Recent
2000 character limit reached

CAGE-4: Automated Cyber Defense Benchmark

Updated 12 January 2026
  • CAGE Challenge 4 is a benchmark framing automated cyber defense as a decentralized, graph-based partially observable Markov decision process.
  • It leverages hierarchical graph neural networks to embed variable network observations, ensuring scalable and sample-efficient planning.
  • Integrating Monte Carlo Tree Search with policy distillation, CAGE-4 achieves improved performance over traditional baselines in dynamic, adversarial environments.

CAGE Challenge 4 (CAGE-4 / CC4) is a benchmark for automated cyber defense (ACD) in dynamic, adversarial network environments. It frames cyber defense as a context-based, partially observable Markov decision process (POMDP) with decentralized blue-agent defenders responding to real-time network intrusions. Solutions to CC4 must address sample efficiency, scalability, partial observability, and complex network topologies in environments featuring variable host, service, and topology configurations.

1. Formalization of CAGE-4 as a Decentralized Graph-Based POMDP

CAGE-4 is modeled as a decentralized, context-based POMDP, defined by

M=N,S,A,O,T,Z,R,γ\mathcal{M} = \bigl\langle \mathcal{N},\mathcal{S},\mathcal{A},\mathcal{O},\mathcal{T},\mathcal{Z},\mathcal{R},\gamma \bigr\rangle

where

  • N\mathcal{N}: set of blue-agent defenders (five in CC4), each controlling a local subnet;
  • S\mathcal{S}: global state space, encoding network topologies, host statuses, active services, access control lists (ACLs), and message logs;
  • A=×iNA(i)\mathcal{A} = \times_{i\in\mathcal{N}}\mathcal{A}^{(i)}: joint action space, including actions like AnalyzeHost, RestoreService, DeployDecoy, AllowTraffic, BlockTraffic;
  • O=×iNO(i)\mathcal{O} = \times_{i\in\mathcal{N}}\mathcal{O}^{(i)}: joint observation space, with each agent ii receiving partial observation ot(i)o_t^{(i)} (local scan results, alerts, and 8-bit inter-agent messages);
  • T:S×AΔ(S)\mathcal{T}: \mathcal{S} \times \mathcal{A} \to \Delta(\mathcal{S}): transition kernel, unknown;
  • Z:S×AΔ(O)\mathcal{Z}: \mathcal{S} \times \mathcal{A} \to \Delta(\mathcal{O}): observation model;
  • R:S×AR\mathcal{R}: \mathcal{S} \times \mathcal{A} \to \mathbb{R}: reward function defined as

R(s,a)=(weighted sum of) {CompromisedHostsαServiceDowntime}R(s,a) = \text{(weighted sum of) } \{- \mathit{CompromisedHosts} - \alpha\, \mathit{ServiceDowntime}\}

with penalty for compromised hosts and over-restrictive actions.

  • γ\gamma: discount factor ($0.99$ in experiments).

CAGE-4 environments are stochastic, with 5–15 hosts per subnet and 1–5 services per host. This variable-dimension, context-driven formulation necessitates permutation-invariant network representations and scalable planning policies (Li et al., 5 Jan 2026).

2. Graph Neural Network Embedding for Network Observations

Each agent’s observation at time tt is represented as an attributed graph

Gt=(Vt,Et,Xt)\mathcal{G}_t = (\mathcal{V}_t, \mathcal{E}_t, \mathbf{X}_t)

where:

  • Vt\mathcal{V}_t: nodes for Hosts (vhostv_\mathrm{host}), Subnets (vsubv_\mathrm{sub}), Ports/Services (vportv_\mathrm{port}), Files (vfilev_\mathrm{file});
  • Et\mathcal{E}_t: edges encoding host–subnet links, host–port attachments, inter-agent (subnet–subnet) communication;
  • XtRVt×d\mathbf{X}_t \in \mathbb{R}^{|\mathcal{V}_t| \times d}: node features (one-hot OS, role indicators; port/service details; file signature status).

A hierarchical GNN with two aggregation stages produces a fixed-dimensional embedding st,0s_{t,0}: hv(0)=xv;muv(l)=ϕmsg(l)(hu(l),hv(l),eu,v);hv(l+1)=ϕupd(l)(hv(l),uN(v)muv(l))\mathbf{h}_v^{(0)} = x_v;\quad \mathbf{m}_{u \to v}^{(l)} = \phi^{(l)}_\mathrm{msg}(\mathbf{h}_u^{(l)}, \mathbf{h}_v^{(l)}, e_{u,v});\quad \mathbf{h}_v^{(l+1)} = \phi^{(l)}_\mathrm{upd}\left(\mathbf{h}_v^{(l)}, \sum_{u \in \mathcal{N}(v)} \mathbf{m}_{u\to v}^{(l)}\right) for layers l=0,,L1l = 0, \dots, L-1. The final graph embedding incorporates both structural and contextual (temporal phase, inter-agent messages) information: st,0=hθ(ot)=Readout({hv(L)}vVt)gts_{t,0} = h_\theta(o_{\leq t}) = \mathrm{Readout}\left(\{\mathbf{h}_v^{(L)}\}_{v \in \mathcal{V}_t}\right) \Vert \mathbf{g}_t This GNN approach enables generalization across subnets of variable sizes and topologies (Li et al., 5 Jan 2026).

3. Monte Carlo Tree Search in Latent State Space

Action planning uses a version of Monte Carlo Tree Search (MCTS), inspired by the MuZero algorithm, operating in the latent graph-embedded state space:

  • Representation hθh_\theta maps observation histories to a root state;
  • Dynamics gθg_\theta predicts future latent states and rewards for hypothetical actions;
  • Prediction fθf_\theta provides prior distributions over actions and value estimates.

During each decision step, MM simulations are conducted:

  • Selection: Actions are chosen via a pUCT criterion maximizing: Q(s,a)+P(s,a)bN(s,b)1+N(s,a)c1Q(s,a) + P(s,a)\, \frac{\sqrt{\sum_b N(s,b)}}{1 + N(s,a)}\,c_1 where Q(s,a)Q(s,a) is the mean value, P(s,a)P(s,a) is the action prior, N(s,a)N(s,a) is the visit count, c1c_1 is a scheduled exploration constant.
  • Expansion & Evaluation: The learned dynamics gθg_\theta and prediction fθf_\theta generate hypothetical child states, rewards, and action priors.
  • Backup: nn-step bootstrapped returns are computed and distributed back up the tree.

The post-search policy is extracted as: πmcts(ast,0)N(st,0,a)1/τ\pi_\mathrm{mcts}(a \mid s_{t,0}) \propto N(s_{t,0}, a)^{1/\tau} This approach balances exploration and exploitation in a high-dimensional, partially observable, and stochastic environment (Li et al., 5 Jan 2026).

4. Policy Distillation and Model-Free Actor Optimization

To enable real-time deployment, the computationally intensive MCTS policy is distilled into a lightweight, GNN-based actor via multi-task learning: Ltotal=LPPO+λπLdistill+λvLvalue\mathcal{L}_\mathrm{total} = \mathcal{L}_\mathrm{PPO} + \lambda_\pi \mathcal{L}_\mathrm{distill} + \lambda_v \mathcal{L}_\mathrm{value}

Ldistill=Et[DKL(πmcts(st)  πθ(ot))]\mathcal{L}_\mathrm{distill} = \mathbb{E}_t\left[ D_\mathrm{KL}\left(\pi_\mathrm{mcts}(\cdot \mid s_t)\,\|\;\pi_\theta(\cdot \mid o_t)\right) \right]

  • Lvalue\mathcal{L}_\mathrm{value}: supervised regression for reward and value prediction over unrolled model trajectories.

This combination transfers high-fidelity, search-generated policies to a reactive policy network, retaining both model-free and planning strengths while enabling rapid inference.

5. Empirical Evaluation and Baseline Comparisons

Empirical assessment employs the official FiniteStateRedAgent adversary on 100 episodes of 500 timesteps each. Table 1 summarizes quantitative results.

Method Reward (μ±σ\mu\pm\sigma) Clean Hosts (%) Mean TTR (timesteps)
DQN (Tabular) 606.20±43.22-606.20\pm43.22 19 142.3
PPO (Tabular) 597.28±41.98-597.28\pm41.98 21 138.6
GCN 193.68±21.07-193.68\pm21.07 74 58.7
ACDZero 150.03±19.85\mathbf{-150.03\pm19.85} 82 46.2

Key findings:

  • ACDZero improves mean reward by 29.2% relative to the GCN baseline (–193.68 → –150.03).
  • Clean-host ratio increases (74% → 82%), and mean time-to-recovery is reduced.
  • Reward variance is lower (5.8% drop), indicating more robust and consistent performance.

Ablative analysis indicates that

  • Removing MCTS degrades ACDZero to GCN levels;
  • Omitting policy distillation yields intermediate performance (–175.23 mean reward);
  • Disabling Dirichlet noise or dynamic c1c_1 scheduling each incur additional drops.

Convergence is reported at ~30k episodes for ACDZero, 25% faster than GCN at 40k, with tabular methods failing to progress (Li et al., 5 Jan 2026).

6. Significance and Implications for Automated Cyber Defense

CC4 demonstrates the viability of planning-centric, graph-embedded reinforcement learning systems for ACD in highly variable, stochastic, and adversarial network settings. The integration of GNNs for state and observation modeling, MCTS for action planning in latent spaces, and distillation for scalable policy deployment establishes a template for future research in sample-efficient, robust cyber defense agents. The systematic improvement over both tabular and prior graph-RL baselines suggests the importance of structured representations and planning under partial observability. A plausible implication is the extensibility of these architectures to other domains involving graph-structured decision problems under adversarial uncertainty (Li et al., 5 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CAGE Challenge 4 (CAGE-4 / CC4).