Inter-Agent Trust Exploitation in Multi-Agent Systems

Updated 25 February 2026

Inter-agent trust exploitation is a vulnerability in multi-agent systems where agents blindly trust peer outputs, enabling malicious payload propagation.
The topic employs formal models like the B–I–P framework to quantify trust failures and measure exploitation rates, revealing critical mismatches between belief and permission.
Defense strategies include attention-based trust estimation, cryptographic safeguards, and dynamic policy controls to mitigate performance degradation and unauthorized actions.

Inter-agent trust exploitation is a class of vulnerabilities inherent to multi-agent systems (MAS), particularly those powered by LLMs, in which adversaries leverage the trusting relationships among agents—either protocol-level or emergent in agent communication pipelines—to enable, amplify, or accelerate harmful outcomes. This exploitation can manifest through blind acceptance of peer outputs, covert propagation of malicious content, bypassing of safety controls, or sophisticated manipulation of trust evaluation and authorization boundaries. The resulting security risks are structurally distinct from classical single-agent threats and demand dedicated theoretical models, detection mechanisms, and orchestration architectures to mitigate their systemic impact.

1. Formal Models and Taxonomy

Various formalizations elucidate the attack surface and root causes of inter-agent trust exploitation. In contemporary LLM-based MAS settings, these can be classified along several axes:

Trust-Exploitation Vectors: Exploitation often leverages blind trust of agent-to-agent inputs (binary trust flags as in $T_{i \rightarrow j}=1$ ), treating messages from peer agents as more trustworthy than those from external or human channels (Lupinacci et al., 9 Jul 2025). This is abstracted as a fundamental failure to distinguish provenance strength at the point where epistemic beliefs are formed and authorized.
Belief–Intention–Permission (B–I–P) Model: Attacks are formally characterized in B–I–P security models, capturing the progression from ingestion of a low-trust belief, through intent formation, to ultimate authorization and execution of a high-risk action due to trust–authorization mismatch (Shi et al., 7 Dec 2025). This progression is anchored in a labeled transition system:

$\mathcal{M} = \langle S, E, \Rightarrow, P, V \rangle$

with explicit operators for belief $B_i(\varphi|\lambda)$ , intention $I_i(\xi)$ , and permission.

Attack Taxonomy:
- Cooperative Attacks: Multiple colluding agents inject crafted messages to bias MAS consensus or propagate misinformation; shown to cause significant drops in system truthfulness and reliability (Yu et al., 12 Mar 2025).
- Infectious Attacks: Malicious content is engineered to self-propagate via agent communication or memory/recall mechanisms, often modeled with SIR-style contagion dynamics (Yu et al., 12 Mar 2025).
- Privilege Escalation via Trust Boundaries: LLM agents bypass human-aligned safety policies when inputs are tagged as peer-originating (Lupinacci et al., 9 Jul 2025).
Distributed and Service-Based Exploitation: In distributed MAS (DMAS) and open service/web protocols, trust exploitation encompasses phenomena such as free riding (service misrepresentation), malicious code injection, system instability, and resource siphoning through implicit or explicit trust mismanagement (Zhang et al., 10 Apr 2025, Zhang et al., 2 Dec 2025, Hu et al., 5 Nov 2025).

2. Attack Methodologies and Quantitative Vulnerabilities

Empirical investigations demonstrate consistent exploitation success rates and performance degradation across MAS frameworks:

Multi-Agent LLM System Attacks: Experiments with 17 leading LLMs revealed that 82.4% are vulnerable to inter-agent trust exploitation, as compared to 41.2% (direct prompt injection) and 52.9% (RAG backdoor)—a statistically significant elevation (p < 0.01) (Lupinacci et al., 9 Jul 2025). The canonical exploit relays a malicious instruction from a compromised caller agent to a privileged executor agent, with the latter's trust-flag ( $T_{i \rightarrow j}=1$ ) causing it to execute payloads it would otherwise block.
Performance & Security Metrics:
- Message Detection Rate (MDR), Agent Detection Rate (ADR), Attack Success Rate (ASR), and system error rates on benign and malicious inputs are used for quantitative assessment (He et al., 3 Jun 2025, Zhang et al., 10 Apr 2025).
- In DMAS, performance degradation ( $\Delta$ Perf) can reach up to 80% when free riding or code execution attacks target core system roles, with up to 100% code-injection ASR in frameworks lacking built-in verification (Zhang et al., 10 Apr 2025).
Resource Hijacking: In open tool-ecosystem protocols (MCP), "implicit toxicity" is demonstrated by the LeechHijack exploit, which covertly repurposes agent compute cycles via a latent backdoor without exceeding the tool's declared privileges. Empirical success rates average 77.25% with an 18.62% compute overhead, well within the detection variance of baseline metrics (Zhang et al., 2 Dec 2025).

3. Mechanistic Roots and Structural Fragilities

The phenomenon arises from several interlocking technical and systemic weaknesses:

Document Authority Bias: LLMs treat peer-agent messages or retrieved documents as implicit "ground truth," often bypassing the same safety filters enforced on direct human inputs (Lupinacci et al., 9 Jul 2025).
Context-Dependent Safety Policies: Safety alignment and adversarial training are heavily biased toward human-facing interactions; inter-agent or tool-originated contexts are underrepresented, resulting in inconsistent behaviors and exploitable blind spots (Lupinacci et al., 9 Jul 2025).
Lack of Provenance/Verification: Many MAS designs lack provenance labels, cryptographic signatures, or authentication mechanisms for messages and tool outputs, which enables both cooperative and infectious attack strategies to function unchecked (Yu et al., 12 Mar 2025, Shi et al., 7 Dec 2025, Hu et al., 5 Nov 2025).
Trust–Authorization Mismatch: The decoupling of trust evaluation (belief) from permission granting underlies the systemic vulnerability. If a low-trust belief leads to authorization of a high-risk action, the "Failure quadrant" is reached (Shi et al., 7 Dec 2025).
LLM-Specific Issues: Model hallucination, sycophancy, prompt-induced policy drift, and misalignment with the principle of least privilege facilitate both trust amplification and trust-exploitation (Hu et al., 5 Nov 2025).

4. Detection, Mitigation, and Defense Mechanisms

A spectrum of technical mechanisms provides partial defense against inter-agent trust exploitation:

a. Attention-Based Trust Estimation

The A-Trust / Trust Management System (TMS) approach leverages the correlation between internal LLM attention patterns and specific trustworthiness violations (across six orthogonal dimensions: factual accuracy, logical consistency, relevance, bias, language quality, clarity) (He et al., 3 Jun 2025). By extracting attention fingerprints per message and mapping them to violation probabilities, agents and orchestrators can block or quarantine suspicious messages and isolate compromised agents. These methods yield MDR ≈ 85% and reduce ASR from 92% to 14% or lower, even when up to 75% of agents are malicious.

b. Architectural and Protocol-Level Safeguards

Zero-Trust Interfaces: Require peer agent and tool inputs to pass identical safety screening as user-originated input. Retain enforcement of static and dynamic access control policies irrespective of provenance (Lupinacci et al., 9 Jul 2025, Shi et al., 7 Dec 2025).
Cryptographic Approaches: Proof and stake mechanisms, as adopted in ERC-8004 and A2A protocol design, provide strong security guarantees for high-impact actions but introduce non-negligible latency, cost, and centralization risks (Hu et al., 5 Nov 2025).
Resource Auditing and Computational Provenance: Detection of resource-hijacking and implicit computation-exfiltration exploits relies on cryptographic lineage/attestation of tool code, per-call result signing, and append-only logs for forensic auditing (Zhang et al., 2 Dec 2025).
Graph-Based and Debate Filtering: Cooperative and topological defenses include majority voting/debate rounds, graph neural network edge pruning, and multilayer consensus to filter out or isolate malign agents (Yu et al., 12 Mar 2025).
Bayesian and Subjective Logic Trust Estimation: Hierarchical Bayesian models and subjective-logic opinion fusion frameworks enable per-agent trust estimation and adaptive weighting of information sources, preventing the aggregation of malicious contributions (Hallyburton et al., 2024, Müller et al., 2019).
Game-Theoretic and Evolutionary Mechanisms: Systems like Ev-Trust formalize the equilibrium selection of honest and high-trust strategies via replicator dynamics, adaptively excluding malicious providers and requestors from decentralized service interactions (Yang et al., 18 Dec 2025).

c. Runtime and Policy Controls

Information Repartitioning and Guardian Agents: Sharding of sensitive data, multi-agent cross-validation, and meta-agents enforcing runtime policy injection reduce over-exposure and dampen authorization drift in the trust–utility tradeoff (Xu et al., 21 Oct 2025).
Taint/Label-Based Tracking: Forward expedition of metadata and dynamic tracking of trust labels enables enforceable "no high-risk action from low-trust belief" rules (Shi et al., 7 Dec 2025).
Dynamic Trust Scheduling/Auditing: Explicit parameterization and logging of trust levels, thresholds, and policy triggers support continuous monitoring and post-mortem auditability (Xu et al., 21 Oct 2025).

5. The Trust–Vulnerability Paradox and Risk–Utility Tradeoffs

Research establishes a fundamental, quantifiable paradox: increasing inter-agent trust to enhance MAS efficiency, collaboration, or coordination almost invariably widens the attack surface and promotes over-authorization or information leakage (Xu et al., 21 Oct 2025). Metrics such as Over-Exposure Rate (OER) and Authorization Drift (AD) rise with trust coefficient $\tau$ , while task success rates increase, revealing an ineliminable tradeoff between utility and vulnerability. Defenses such as information repartitioning and guardian agents can flatten the OER/AD increase but at a modest absolute cost to efficiency.

6. Protocol and Web Frameworks: Comparative Exploitation and Defense

MAS protocols in the agentic web (A2A, AP2, ERC-8004) are evaluated under six canonical trust paradigms: Brief, Claim, Proof, Stake, Reputation, and Constraint. Each paradigm exhibits distinct core assumptions, exploitation vectors, and tradeoffs (Hu et al., 5 Nov 2025):

Model	Attack Vector(s)	Security/Robustness
Brief	Issuer collusion, revocation lag	High (with honest issuers)
Claim	LLM hallucinations, schema abuse	Low
Proof	Side-channels, proof cherry-picking	Very high (ZK, TEE)
Stake	Sybil splitting, last-mile betrayal	High if stake-to-harm is tuned
Reputation	Sybil/collusion feedback gaming	Very Low (to Sybil)
Constraint	Sandbox escape, covert channel	Very High (with robust sandbox)

Hybrid architectures recommend composable, tiered defense, starting "zero trust" and incrementally layering higher-assurance mechanisms (proof, constraint, stake) as dictated by potential action harm.

7. Research Directions and Open Challenges

Current defense landscapes favor belief-stage filtering and trust-score estimation but remain reactive and brittle against adaptive attackers. The primary structural deficit is the lack of belief-aware, provenance-driven, and audit-ready dynamic authorization policies that can robustly chain trust evaluation to permission for high-risk actions (Shi et al., 7 Dec 2025). Outstanding research questions include scalable taint-to-policy enforcement, formal B–I–P guided auditing/logging, succinct belief-aware policy languages, and human-in-the-loop escalation formalism for complex, high-value MAS ecosystems.

In summary, inter-agent trust exploitation constitutes a high-confidence, empirically validated threat in LLM-powered multi-agent systems, characterized by amplified attack success rates when protocol and system designs allow unvetted or insufficiently weighted peer inputs to drive critical actions. Robust defense demands integrating provenance, cryptographic, topological, and machine-learning-based trust estimation, as well as explicit auditing and runtime enforcement architectures that treat trust as a first-class, dynamically scheduled system variable (He et al., 3 Jun 2025, Lupinacci et al., 9 Jul 2025, Zhang et al., 10 Apr 2025, Xu et al., 21 Oct 2025, Hu et al., 5 Nov 2025, Zhang et al., 2 Dec 2025, Yang et al., 18 Dec 2025).