PeerGuard: Multi-Agent LLM Defense
- PeerGuard is a mutual reasoning defense mechanism for multi-agent LLM systems that detects prompt injection backdoor attacks using chain-of-thought outputs.
- It leverages peer consistency assessment by comparing agents’ reasoning sequences and final answers to immediately identify logical inconsistencies.
- Experimental evaluations on models like GPT-4o and Llama 3 show high true positive rates (up to 96%) and low false positive rates (below 10%) across diverse frameworks.
PeerGuard is a mutual reasoning defense mechanism for multi-agent systems based on LLMs, targeting backdoor attacks delivered via prompt injection. Its core innovation is using agent-to-agent reasoning to detect illogical or inconsistent behavior indicative of prompt-based poisoning. PeerGuard augments multi-agent AI workflows such as debate loops and dialogue-based frameworks, leveraging chain-of-thought (CoT) responses to expose shortcut behaviors caused by backdoor triggers. Experimental validation across leading LLM platforms (GPT-4o, Llama 3) and multiple agent interaction frameworks demonstrates robustness, high accuracy, and applicability without requiring model retraining or parameter access (Fan et al., 16 May 2025).
1. Threat Landscape: Backdoor Attacks in LLM Multi-Agent Systems
Backdoor attacks in multi-agent LLM settings exploit adversarial control over API calls, enabling stealthy injection of malicious instructions at the prompt level. The attacker can craft system or user prompts containing:
- A trigger token (e.g., “cf”) engineered to activate backdoor behavior.
- A trigger-to-output mapping enforcing shortcut decisions (e.g., always select option C).
- Mixed clean and backdoored in-context demonstrations illustrating the intended manipulation.
Crucially, the threat model restricts adversarial capabilities: the attacker cannot modify model weights, access training data, or prompt fine-tuning, imposing a realistic but potent poisoning scenario. Three canonical poisoning scenarios are defined:
| Scenario | Description |
|---|---|
| S1 | Both agents are poisoned with the same backdoor |
| S2 | Exactly one agent is poisoned |
| S3 | Neither agent is poisoned |
These scenarios span the spectrum from full compromise (S1) to uncertain adversarial presence (S2) and clean interactions (S3).
2. PeerGuard Architecture and Workflow
PeerGuard operates by integrating a three-step mutual reasoning protocol into standard multi-agent workflows:
Step 1: Chain-of-Thought Generation
Each agent receives the task (e.g., multiple-choice question) alongside a uniform reasoning template. Output consists of a reasoning sequence and a final answer .
Step 2: Peer Consistency Assessment
Agents exchange responses and each submits two binary judgments:
- Agreement with peer’s answer.
- Logical consistency: “Does follow from ?”
The judgments are appended to the transcript, providing the key detection signals.
Step 3 (Optional): Refinement
Each agent may revise its answer using peer feedback. This step is primarily for enhanced debate fidelity and does not affect PeerGuard’s detection efficacy.
PeerGuard is compatible with debate loops such as those described by Du et al. (2023), as well as dialogue-based frameworks including AutoGen and CAMEL.
3. Detection Mechanism
PeerGuard calculates a consistency score for each agent:
where flags whether peer identifies an inconsistency in agent ’s output ($1$ = inconsistency detected), and is the peer set (for two agents, ).
The decision rule is binary: agent is declared poisoned if , with calibrated in and typically set to $1$ (i.e., any inconsistency suffices for detection). This approach ensures immediate detection upon logical contradiction between chains-of-thought and answers.
4. Training Protocol, Calibration, and Attack Configuration
PeerGuard operates entirely at inference time via prompt modification; no model retraining or fine-tuning is performed. Calibration involves tuning and prompt parameters against a held-out clean development set to balance false positive and true positive rates.
Backdoor injections use the “BadWord” method: a trigger token (“cf”) is inserted into 10–20% of benchmark examples to simulate shortcut activation. Tuning enables:
- False positive rates (FPR) as low as 5–10%,
- True positive rates (TPR) of 75–96%.
All performance calibration is metric-driven and holds across multiple benchmarks and agent frameworks.
5. Experimental Design and Quantitative Findings
Evaluated Models and Frameworks:
- LLMs: GPT-4o (ChatGPT series), Llama 3 70B
- Frameworks: Two-agent debate loop (as per Du et al. 2023), AutoGen, and CAMEL
Tasks and Datasets:
- MMLU (57 subjects, four-choice)
- CSQA (commonsense, five-choice)
- ARC-Easy/ARC-Challenge (four-choice)
Attack Setup:
- Trigger (“cf”) mapped to forced choice “C”
- Injection via system/user prompt
Metrics definitions:
Key Results:
- In the absence of defense, backdoor-triggered inputs result in >90% probability of misclassification (forced choice “C”) on both GPT-4o and Llama 3; clean inputs <10% misclassification.
- PeerGuard yields TPR = [0.76, 0.96], outperforming ZS-CoT (TPR ≈ 0.40–0.80) and Auto-CoT (TPR ≈ 0.10–0.50). FPR remains below 0.10 in all cases.
- Architecture-agnostic efficacy: TPR > 0.70 when ported to AutoGen and CAMEL frameworks.
- Per Table 2 in (Fan et al., 16 May 2025), for GPT-4o (S1, Agent 1): TPR = 0.81/0.85/0.95/0.89 and FPR ≈ 0.08/0.03/0.02/0.04 on MMLU/CSQA/ARC-E/ARC-C, respectively. Llama 3 exhibits similar performance characteristics.
6. Strengths, Limitations, and Prospective Extensions
Strengths:
- Does not require any model retraining; prompt-based, plug-and-play defense.
- Exploits chain-of-thought generation—a natural LLM skill—for validation and anomaly detection.
- Robust performance across LLM families and multi-agent frameworks.
Limitations:
- Detection depends on presence of at least one honest agent; fully colluding poisoned agents may evade PeerGuard.
- Binary inconsistency signaling may fail to capture subtler logical compromise.
- Single-round detection has limited coverage for backdoors maintaining superficial reasoning validity.
Potential Extensions:
- Aggregating soft consistency scores from three or more agents for enhanced robustness (Editor’s term: “consensus scoring”).
- Embedding-based or continuous metrics for logic consistency evaluation.
- Combinatorial integration with orthogonal defenses (e.g., input sanitization).
- Formal study of adversarial collusion strategies and theoretical defense bounds.
Open Questions:
- Defense strategy for scenarios where all agents share the same poisoned prompt.
- Applicability of PeerGuard to black-box peer models lacking transparent reasoning traces.
- Limits and failure modes of LLMs’ self-critique under adversarial conditions.
7. Context and Future Directions
PeerGuard represents a shift in multi-agent safety toward mutual reasoning- and interaction-driven defenses, distinct from prior LLM-centric single-agent approaches. The methodology is demonstrably generalizable across both debate-centric and dialogue-centric agent architectures. Key performance trade-offs remain in tuning detection thresholds and evaluating robustness against complex attack compositions. A plausible implication is that future research may focus on extending PeerGuard’s reasoning-based detection to larger communities of agents, incorporating richer logical consistency signals, and formalizing defenses against multi-agent collusion and prompt uniformity. Open research areas include theoretical guarantees, black-box model compatibility, and layered defense frameworks (Fan et al., 16 May 2025).