Papers
Topics
Authors
Recent
2000 character limit reached

PeerGuard: Multi-Agent LLM Defense

Updated 29 November 2025
  • PeerGuard is a mutual reasoning defense mechanism for multi-agent LLM systems that detects prompt injection backdoor attacks using chain-of-thought outputs.
  • It leverages peer consistency assessment by comparing agents’ reasoning sequences and final answers to immediately identify logical inconsistencies.
  • Experimental evaluations on models like GPT-4o and Llama 3 show high true positive rates (up to 96%) and low false positive rates (below 10%) across diverse frameworks.

PeerGuard is a mutual reasoning defense mechanism for multi-agent systems based on LLMs, targeting backdoor attacks delivered via prompt injection. Its core innovation is using agent-to-agent reasoning to detect illogical or inconsistent behavior indicative of prompt-based poisoning. PeerGuard augments multi-agent AI workflows such as debate loops and dialogue-based frameworks, leveraging chain-of-thought (CoT) responses to expose shortcut behaviors caused by backdoor triggers. Experimental validation across leading LLM platforms (GPT-4o, Llama 3) and multiple agent interaction frameworks demonstrates robustness, high accuracy, and applicability without requiring model retraining or parameter access (Fan et al., 16 May 2025).

1. Threat Landscape: Backdoor Attacks in LLM Multi-Agent Systems

Backdoor attacks in multi-agent LLM settings exploit adversarial control over API calls, enabling stealthy injection of malicious instructions at the prompt level. The attacker can craft system or user prompts containing:

  • A trigger token (e.g., “cf”) engineered to activate backdoor behavior.
  • A trigger-to-output mapping enforcing shortcut decisions (e.g., always select option C).
  • Mixed clean and backdoored in-context demonstrations illustrating the intended manipulation.

Crucially, the threat model restricts adversarial capabilities: the attacker cannot modify model weights, access training data, or prompt fine-tuning, imposing a realistic but potent poisoning scenario. Three canonical poisoning scenarios are defined:

Scenario Description
S1 Both agents are poisoned with the same backdoor
S2 Exactly one agent is poisoned
S3 Neither agent is poisoned

These scenarios span the spectrum from full compromise (S1) to uncertain adversarial presence (S2) and clean interactions (S3).

2. PeerGuard Architecture and Workflow

PeerGuard operates by integrating a three-step mutual reasoning protocol into standard multi-agent workflows:

Step 1: Chain-of-Thought Generation

Each agent i{0,1}i \in \{0,1\} receives the task (e.g., multiple-choice question) alongside a uniform reasoning template. Output consists of a reasoning sequence Ri=[ri,1,...,ri,k]R_i = [r_{i,1}, ..., r_{i,k}] and a final answer ai{A,B,C,D}a_i \in \{\mathrm{A,B,C,D}\}.

Step 2: Peer Consistency Assessment

Agents exchange responses and each submits two binary judgments:

  • Agreement with peer’s answer.
  • Logical consistency: “Does aia_i follow from RiR_i?”

The judgments are appended to the transcript, providing the key detection signals.

Step 3 (Optional): Refinement

Each agent may revise its answer using peer feedback. This step is primarily for enhanced debate fidelity and does not affect PeerGuard’s detection efficacy.

PeerGuard is compatible with debate loops such as those described by Du et al. (2023), as well as dialogue-based frameworks including AutoGen and CAMEL.

3. Detection Mechanism

PeerGuard calculates a consistency score for each agent:

si=1PijPicjis_i = \frac{1}{|\mathcal{P}_i|} \sum_{j\in\mathcal{P}_i} c_{j\to i}

where cji{0,1}c_{j\to i} \in \{0,1\} flags whether peer jj identifies an inconsistency in agent ii’s output ($1$ = inconsistency detected), and Pi\mathcal{P}_i is the peer set (for two agents, Pi=1|\mathcal{P}_i|=1).

The decision rule is binary: agent ii is declared poisoned if siτs_i \geq \tau, with τ\tau calibrated in [0,1][0,1] and typically set to $1$ (i.e., any inconsistency suffices for detection). This approach ensures immediate detection upon logical contradiction between chains-of-thought and answers.

4. Training Protocol, Calibration, and Attack Configuration

PeerGuard operates entirely at inference time via prompt modification; no model retraining or fine-tuning is performed. Calibration involves tuning τ\tau and prompt parameters against a held-out clean development set to balance false positive and true positive rates.

Backdoor injections use the “BadWord” method: a trigger token (“cf”) is inserted into 10–20% of benchmark examples to simulate shortcut activation. Tuning enables:

  • False positive rates (FPR) as low as 5–10%,
  • True positive rates (TPR) of 75–96%.

All performance calibration is metric-driven and holds across multiple benchmarks and agent frameworks.

5. Experimental Design and Quantitative Findings

Evaluated Models and Frameworks:

  • LLMs: GPT-4o (ChatGPT series), Llama 3 70B
  • Frameworks: Two-agent debate loop (as per Du et al. 2023), AutoGen, and CAMEL

Tasks and Datasets:

  • MMLU (57 subjects, four-choice)
  • CSQA (commonsense, five-choice)
  • ARC-Easy/ARC-Challenge (four-choice)

Attack Setup:

  • Trigger (“cf”) mapped to forced choice “C”
  • Injection via system/user prompt

Metrics definitions:

TPR=# triggered inputs detected# triggered inputsFPR=# clean inputs flagged# clean inputs\mathrm{TPR} = \frac{\#\ \text{triggered inputs detected}}{\#\ \text{triggered inputs}} \qquad \mathrm{FPR} = \frac{\#\ \text{clean inputs flagged}}{\#\ \text{clean inputs}}

Key Results:

  • In the absence of defense, backdoor-triggered inputs result in >90% probability of misclassification (forced choice “C”) on both GPT-4o and Llama 3; clean inputs <10% misclassification.
  • PeerGuard yields TPR = [0.76, 0.96], outperforming ZS-CoT (TPR ≈ 0.40–0.80) and Auto-CoT (TPR ≈ 0.10–0.50). FPR remains below 0.10 in all cases.
  • Architecture-agnostic efficacy: TPR > 0.70 when ported to AutoGen and CAMEL frameworks.
  • Per Table 2 in (Fan et al., 16 May 2025), for GPT-4o (S1, Agent 1): TPR = 0.81/0.85/0.95/0.89 and FPR ≈ 0.08/0.03/0.02/0.04 on MMLU/CSQA/ARC-E/ARC-C, respectively. Llama 3 exhibits similar performance characteristics.

6. Strengths, Limitations, and Prospective Extensions

Strengths:

  • Does not require any model retraining; prompt-based, plug-and-play defense.
  • Exploits chain-of-thought generation—a natural LLM skill—for validation and anomaly detection.
  • Robust performance across LLM families and multi-agent frameworks.

Limitations:

  • Detection depends on presence of at least one honest agent; fully colluding poisoned agents may evade PeerGuard.
  • Binary inconsistency signaling may fail to capture subtler logical compromise.
  • Single-round detection has limited coverage for backdoors maintaining superficial reasoning validity.

Potential Extensions:

  • Aggregating soft consistency scores from three or more agents for enhanced robustness (Editor’s term: “consensus scoring”).
  • Embedding-based or continuous metrics for logic consistency evaluation.
  • Combinatorial integration with orthogonal defenses (e.g., input sanitization).
  • Formal study of adversarial collusion strategies and theoretical defense bounds.

Open Questions:

  • Defense strategy for scenarios where all agents share the same poisoned prompt.
  • Applicability of PeerGuard to black-box peer models lacking transparent reasoning traces.
  • Limits and failure modes of LLMs’ self-critique under adversarial conditions.

7. Context and Future Directions

PeerGuard represents a shift in multi-agent safety toward mutual reasoning- and interaction-driven defenses, distinct from prior LLM-centric single-agent approaches. The methodology is demonstrably generalizable across both debate-centric and dialogue-centric agent architectures. Key performance trade-offs remain in tuning detection thresholds and evaluating robustness against complex attack compositions. A plausible implication is that future research may focus on extending PeerGuard’s reasoning-based detection to larger communities of agents, incorporating richer logical consistency signals, and formalizing defenses against multi-agent collusion and prompt uniformity. Open research areas include theoretical guarantees, black-box model compatibility, and layered defense frameworks (Fan et al., 16 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to PeerGuard.