CAGE-4: Automated Cyber Defense Benchmark
- CAGE Challenge 4 is a benchmark framing automated cyber defense as a decentralized, graph-based partially observable Markov decision process.
- It leverages hierarchical graph neural networks to embed variable network observations, ensuring scalable and sample-efficient planning.
- Integrating Monte Carlo Tree Search with policy distillation, CAGE-4 achieves improved performance over traditional baselines in dynamic, adversarial environments.
CAGE Challenge 4 (CAGE-4 / CC4) is a benchmark for automated cyber defense (ACD) in dynamic, adversarial network environments. It frames cyber defense as a context-based, partially observable Markov decision process (POMDP) with decentralized blue-agent defenders responding to real-time network intrusions. Solutions to CC4 must address sample efficiency, scalability, partial observability, and complex network topologies in environments featuring variable host, service, and topology configurations.
1. Formalization of CAGE-4 as a Decentralized Graph-Based POMDP
CAGE-4 is modeled as a decentralized, context-based POMDP, defined by
where
- : set of blue-agent defenders (five in CC4), each controlling a local subnet;
- : global state space, encoding network topologies, host statuses, active services, access control lists (ACLs), and message logs;
- : joint action space, including actions like AnalyzeHost, RestoreService, DeployDecoy, AllowTraffic, BlockTraffic;
- : joint observation space, with each agent receiving partial observation (local scan results, alerts, and 8-bit inter-agent messages);
- : transition kernel, unknown;
- : observation model;
- : reward function defined as
with penalty for compromised hosts and over-restrictive actions.
- : discount factor ($0.99$ in experiments).
CAGE-4 environments are stochastic, with 5–15 hosts per subnet and 1–5 services per host. This variable-dimension, context-driven formulation necessitates permutation-invariant network representations and scalable planning policies (Li et al., 5 Jan 2026).
2. Graph Neural Network Embedding for Network Observations
Each agent’s observation at time is represented as an attributed graph
where:
- : nodes for Hosts (), Subnets (), Ports/Services (), Files ();
- : edges encoding host–subnet links, host–port attachments, inter-agent (subnet–subnet) communication;
- : node features (one-hot OS, role indicators; port/service details; file signature status).
A hierarchical GNN with two aggregation stages produces a fixed-dimensional embedding : for layers . The final graph embedding incorporates both structural and contextual (temporal phase, inter-agent messages) information: This GNN approach enables generalization across subnets of variable sizes and topologies (Li et al., 5 Jan 2026).
3. Monte Carlo Tree Search in Latent State Space
Action planning uses a version of Monte Carlo Tree Search (MCTS), inspired by the MuZero algorithm, operating in the latent graph-embedded state space:
- Representation maps observation histories to a root state;
- Dynamics predicts future latent states and rewards for hypothetical actions;
- Prediction provides prior distributions over actions and value estimates.
During each decision step, simulations are conducted:
- Selection: Actions are chosen via a pUCT criterion maximizing: where is the mean value, is the action prior, is the visit count, is a scheduled exploration constant.
- Expansion & Evaluation: The learned dynamics and prediction generate hypothetical child states, rewards, and action priors.
- Backup: -step bootstrapped returns are computed and distributed back up the tree.
The post-search policy is extracted as: This approach balances exploration and exploitation in a high-dimensional, partially observable, and stochastic environment (Li et al., 5 Jan 2026).
4. Policy Distillation and Model-Free Actor Optimization
To enable real-time deployment, the computationally intensive MCTS policy is distilled into a lightweight, GNN-based actor via multi-task learning:
- : standard PPO-style actor-critic loss for sample-efficient reinforcement learning.
- : Kullback-Leibler divergence between MCTS policy and actor output.
- : supervised regression for reward and value prediction over unrolled model trajectories.
This combination transfers high-fidelity, search-generated policies to a reactive policy network, retaining both model-free and planning strengths while enabling rapid inference.
5. Empirical Evaluation and Baseline Comparisons
Empirical assessment employs the official FiniteStateRedAgent adversary on 100 episodes of 500 timesteps each. Table 1 summarizes quantitative results.
| Method | Reward () | Clean Hosts (%) | Mean TTR (timesteps) |
|---|---|---|---|
| DQN (Tabular) | 19 | 142.3 | |
| PPO (Tabular) | 21 | 138.6 | |
| GCN | 74 | 58.7 | |
| ACDZero | 82 | 46.2 |
Key findings:
- ACDZero improves mean reward by 29.2% relative to the GCN baseline (–193.68 → –150.03).
- Clean-host ratio increases (74% → 82%), and mean time-to-recovery is reduced.
- Reward variance is lower (5.8% drop), indicating more robust and consistent performance.
Ablative analysis indicates that
- Removing MCTS degrades ACDZero to GCN levels;
- Omitting policy distillation yields intermediate performance (–175.23 mean reward);
- Disabling Dirichlet noise or dynamic scheduling each incur additional drops.
Convergence is reported at ~30k episodes for ACDZero, 25% faster than GCN at 40k, with tabular methods failing to progress (Li et al., 5 Jan 2026).
6. Significance and Implications for Automated Cyber Defense
CC4 demonstrates the viability of planning-centric, graph-embedded reinforcement learning systems for ACD in highly variable, stochastic, and adversarial network settings. The integration of GNNs for state and observation modeling, MCTS for action planning in latent spaces, and distillation for scalable policy deployment establishes a template for future research in sample-efficient, robust cyber defense agents. The systematic improvement over both tabular and prior graph-RL baselines suggests the importance of structured representations and planning under partial observability. A plausible implication is the extensibility of these architectures to other domains involving graph-structured decision problems under adversarial uncertainty (Li et al., 5 Jan 2026).