Autonomous Multi-Agent Defense

Updated 13 December 2025

Autonomous multi-agent defense is a paradigm leveraging distributed agents to autonomously detect, mitigate, and counter adversarial threats in digital and physical domains.
It employs advanced techniques such as multi-agent reinforcement learning, Dec-POMDPs, and hierarchical policy decomposition to coordinate and optimize defense actions.
Key applications include securing enterprise networks, UAV swarms, and OT systems through adaptive strategies and LLM-guided reward designs for robust resilience.

Autonomous multi-agent defense refers to the deployment of distributed, autonomous computational agents—software or cyber-physical—charged with detecting, mitigating, and responding to adversarial threats in complex environments. These agents cooperate (or, in adversarial training, compete) under formal frameworks such as Markov games or partially observable Markov decision processes, employing online learning, optimal control, and coordination mechanisms to achieve system-level resilience. This paradigm spans cyber defense (e.g., enterprise networks, cloud, OT/ICS, LLM infrastructures) and cyber-physical domains (UAV swarms, satellite constellations, sensor fusion for AVs).

1. Formal Models and Core Methodologies

Autonomous multi-agent defense is grounded in mathematical models such as decentralized (partially observable) Markov decision processes (Dec-POMDPs), stochastic games, and optimal control with bounded rational adversaries. System state $s \in S$ comprises the network topology, agent health, asset integrity, and adversarial context, while the action space $A = A_1 \times A_2 \times \ldots \times A_N$ covers both atomic and composite actions—mitigation, deception, recovery, quarantine, collaborative alerting, or moving-target defense. Observation models $O$ encode partial/incomplete information due to noisy sensors or limited agent visibility.

Policy synthesis is realized via multi-agent reinforcement learning (MARL), where agents optimize return $J_i = \mathbb{E}_\pi [\sum_{t=0}^\infty \gamma^t r_i(s_t,a_t) ]$ , possibly zero-sum (defender vs. attacker) or cooperative (defender team). Core MARL algorithms include independent Q-learning, value decomposition (QMIX), policy gradient methods (MAAC, MAPPO), and centralized training with decentralized execution (CTDE), which stabilizes distributed learning by sharing global critics during training but deploying policies on local observations (Landolt et al., 26 May 2025, Wang et al., 2024, Wiebe et al., 2023).

Incorporating safety constraints, formal defense frameworks often add control barrier functions for collision avoidance and resource safety (e.g., ensuring actions $a$ satisfy $(s,a) \in C$ ), as well as observer/corrector modules for resilience against false-data injection (Wang et al., 1 Jan 2025).

2. Reward Design and Learning Procedures

Reward design is central, especially since conventional function engineering is difficult in high-dimensional, adversarial environments. LLMs now assist by ingesting contextual environment and expert-provided policy/goal information to output reward tables or YAML-formatted rule sets. Rewards may incorporate immediate and recurring incentives/penalties for both defenders and attackers: at time $t$ ,

$R_t = r_{b,t} + r_{r,t} + \sum_{k=0}^t \left[r_{b,k}^{\mathrm{reccur}} + r_{r,k}^{\mathrm{reccur}}\right]$

Defenders adapt value estimates $V(s_t)$ toward returns $G_t = \sum_{k=t}^T \gamma^{k-t} R_k$ and optimize using actor-critic methods such as PPO, with neural networks outputting action logits over defense primitives. LLM-guided reward schemes and prompt-based persona tuning yield heterogeneous (proactive, stealthy, aggressive) behavioral archetypes, enabling ensemble policy deployment and meta-controller–driven strategy selection (Mukherjee et al., 20 Nov 2025).

Table: LLM-Based Reward Design Use in DRL-Driven Cyber Defense

Phase	Mechanism	Notable Detail
Context/prompt ingestion	LLM (Claude Sonnet 4)	Encodes topology, agent actions, SME rewards
Reward table synthesis	LLM outputs YAML	Immediate and recurring rewards for actions
Policy training	PPO (MLP policy/value)	Action: do-nothing / place / remove decoy
Persona ensemble & selection	Meta-policy/lookup	Switch policy by real-time red estimation

3. Communication, Coordination, and Hierarchical Decomposition

In partially observable environments, emergent and explicit communication becomes pivotal. Agents employ differentiable communication protocols such as DIAL with strategic action unmasking (SAU), minimizing communication costs via learned, minimalistic (often single-bit) messaging, and integrating communication as a learned action to selectively coordinate on critical decisions (e.g., "unmask analyze" upon cross-subnet threat) (Contractor et al., 19 Jul 2025).

Hierarchical policy architectures decompose monolithic action spaces into interpretable skills or sub-tasks (e.g., host investigation, recovery, traffic control), with a master policy orchestrating sub-policy selection. PPO-based hierarchical learning provides scalability (via reduced effective action space), enhanced interpretability, and rapid transferability of sub-policies to new adversarial environments (Singh et al., 2024). Meta-controllers or rule-based switches between policy incarnations, based on runtime estimate of adversary types, achieve robust worst-case defense (Mukherjee et al., 20 Nov 2025).

4. Defense Against Propagation/Collusion and Malicious Agents

Propagation threats arise when malicious agents exploit inter-agent communication to spread adversarial influence (e.g., infectious LLM prompt or semantic poisoning). Unsupervised anomaly detectors—e.g., BlindGuard—train hierarchical encoders on normal behavior, simulating anomalies via embedding corruption and contrastive learning (InfoNCE loss), enabling efficient malicious agent pruning across arbitrary multi-agent graphs without labeled attack data (Miao et al., 11 Aug 2025).

Consensus-based defense wrappers, such as CP-Guard, use probability-agnostic sample consensus algorithms (PASAC) with task-specific collaborative consistency loss (CCLoss) to identify malicious collaborators in robotic or sensor fusion systems, adapting thresholds online to maintain bounded error/failure rates even under unknown attacker counts. These techniques provably guarantee correct agent filtration given controllable reliability parameters $(\alpha,\beta)$ , with real-world efficacy demonstrated against adversarial feature injection (Hu et al., 28 Jun 2025).

For memory- and prompt-propagation in LLM multi-agent networks, example-based "vaccination" (prepending refusal or alert-handling entries) or tailored instructional system prompts strongly reduces multi-hop infection probability with minimal impairment to legitimate agent collaboration, highlighting the nuanced trade-off between system robustness and productive inter-agent cooperation (Peigne-Lefebvre et al., 26 Feb 2025).

5. Resilience Under Physical and Control-Layer Attacks

For cyber-physical MAS, such as UAV swarms or multi-robot systems, resilience to physically realized attacks such as Denial-of-Service (DoS) or exponentially unbounded false data injection (EU-FDI) is addressed via multi-layered defensive control. Key elements include:

Distributed observer layers robust to input/observation FDI;
Compensational signals estimated online to counter CIL attacks;
Safety-aware quadratic-programming (QP) controllers integrating control-barrier functions for provable inter-agent collision avoidance.

The SAAR (Safety-Aware and Attack-Resilient) controller provably ensures both uniform ultimate boundedness (UUB) of containment error and persistent safety constraints under arbitrarily scaling attack sequences, certified via composite Lyapunov analysis (Wang et al., 1 Jan 2025). In communication-constrained, resource-limited swarms, federated MARL with reward-weighted parameter aggregation and minimal-moving-target defense (e.g., leader-switch, route-mutation, frequency hopping) achieves large gains in attack mitigation, recovery time, and defense cost (Zhou et al., 9 Jun 2025).

6. Evaluation, Benchmarking, and Empirical Insights

Evaluation methodologies span synthetic episodes, cyber-defense gyms (Cyberwheel, CybORG), tailored emulators (e.g., IPMSRL for OT/ICS defense (Wilson et al., 2024)), and robotics AV testbeds (e.g., MAST (Hallyburton et al., 2024)). Key performance metrics include:

Episodic returns (mean, std);
Host compromise/clean rates and time-to-compromise quantiles;
Detection–response latency and false-positive penalties;
Communication cost and defense scalability (in number of agents, messages);
Quantitative resilience against various attacker personas (stealthy, aggressive, diverse).

Empirical results consistently show MARL-based defense (independent, centralized critic, hierarchical, ensemble meta-controllers) outperforms static, heuristic, and single-agent baselines. Structured reward shaping, emergent communication, and ensemble/transfer protocols (with LLM-driven personalization or meta-switching) yield substantive improvements under stochastic/variable attack patterns and demonstrate rapid adaptation in simulated and cyber-physical defense scenarios (Wang et al., 2024, Mukherjee et al., 20 Nov 2025, Wiebe et al., 2023).

7. Future Challenges and Directions

Outstanding challenges include:

Scalability to $100+$ agent systems with dense, time-varying communication topologies and heterogeneous sensor/actuator resources;
Adversarial robustness in the presence of communication jamming, network outliers, or compromised infrastructure;
Continual, safe, and explainable RL for deploying (and monitoring) defense policies in real-world settings;
Richer human-agent interaction loops, permitting adjustable autonomy, shared trust, and compliance with operational and ethical constraints (cf. NATO AICA/MAICA architecture (Theron et al., 2018));
Automated, LLM-guided curriculum and environment generation for closed-loop defense refinement;
Sim-to-real transfer and standardized benchmarking across cyber, OT/ICS, and robotic domains.

The field is rapidly converging on modular, interpretable, and adaptive architectures combining learning, consensus, robust estimation, and negotiation mechanisms, often with LLM or rule-based augmentation of classical RL techniques, to enable practical, scalable, and explainable autonomous multi-agent defense (Landolt et al., 26 May 2025, Mukherjee et al., 20 Nov 2025, Miao et al., 11 Aug 2025, Hu et al., 28 Jun 2025, Singh et al., 2024).