The Trust Paradox in LLM-Based Multi-Agent Systems: When Collaboration Becomes a Security Vulnerability (2510.18563v1)

Published 21 Oct 2025 in cs.CR

Abstract: Multi-agent systems powered by LLMs are advancing rapidly, yet the tension between mutual trust and security remains underexplored. We introduce and empirically validate the Trust-Vulnerability Paradox (TVP): increasing inter-agent trust to enhance coordination simultaneously expands risks of over-exposure and over-authorization. To investigate this paradox, we construct a scenario-game dataset spanning 3 macro scenes and 19 sub-scenes, and run extensive closed-loop interactions with trust explicitly parameterized. Using Minimum Necessary Information (MNI) as the safety baseline, we propose two unified metrics: Over-Exposure Rate (OER) to detect boundary violations, and Authorization Drift (AD) to capture sensitivity to trust levels. Results across multiple model backends and orchestration frameworks reveal consistent trends: higher trust improves task success but also heightens exposure risks, with heterogeneous trust-to-risk mappings across systems. We further examine defenses such as Sensitive Information Repartitioning and Guardian-Agent enablement, both of which reduce OER and attenuate AD. Overall, this study formalizes TVP, establishes reproducible baselines with unified metrics, and demonstrates that trust must be modeled and scheduled as a first-class security variable in multi-agent system design.

Summary

The paper formalizes the Trust-Vulnerability Paradox, establishing that increased inter-agent trust directly amplifies information leakage risks.
The experimental methodology quantifies trust impact using metrics like Over-Exposure Rate (OER) and Authorization Drift (AD) across varied scenarios and frameworks.
Defensive strategies such as Sensitive-Information Repartitioning and GA-Agent enablement effectively mitigate leakage while balancing task success.

The Trust-Vulnerability Paradox in LLM-Based Multi-Agent Systems

Introduction and Problem Formalization

This paper addresses a critical and underexplored issue in LLM-based multi-agent systems (LLM-MAS): the security implications of inter-agent trust. The authors formalize the Trust-Vulnerability Paradox (TVP), which posits that increasing trust between agents—while beneficial for coordination and task success—simultaneously amplifies the risk of over-exposure and over-authorization, leading to unintentional leakage of sensitive information. The paper operationalizes trust as a tunable parameter ( $\tau$ ) and empirically investigates its impact on information leakage, using a scenario-game dataset spanning diverse domains and agent roles.

Figure 1: Overall experimental workflow for evaluating the Trust-Vulnerability Paradox in LLM-based multi-agent systems.

The experimental design isolates the effect of trust by controlling for model, framework, and scenario, enabling a rigorous analysis of how trust modulates the boundary between necessary and excessive information disclosure.

Methodology: Trust Parameterization and Metrics

The core experimental setup involves two agent roles: a Custodian-Keeper-Agent (CK-Agent) holding sensitive information, and a Seeker-Agent (SK-Agent) requesting information to accomplish a task. The trust coefficient $\tau$ is explicitly parameterized at three levels ($0.1$, $0.5$, $0.9$), influencing the CK-Agent's willingness to share information. The paper introduces two unified metrics:

Over-Exposure Rate (OER): The fraction of agent-agent interaction chains where the output exceeds the Minimum Necessary Information (MNI) baseline, i.e., $O \supset A$ .
Authorization Drift (AD): The variance of OER across trust levels, quantifying the sensitivity of leakage to changes in $\tau$ .

The scenario-game dataset comprises 19 sub-scenes across three macro domains (enterprise, deep-sea archaeology, Mars colonization), with 1,488 closed-loop A2A interaction chains generated across four LLM backends (DeepSeek, Qwen, GPT, Llama-3-8B) and three orchestration frameworks (AgentScope, AutoGen, LangGraph).

Empirical Validation of the Trust-Vulnerability Paradox

Monotonic Risk Amplification with Trust

The results robustly confirm the TVP: OER increases monotonically with $\tau$ across all models and frameworks. For example, with DeepSeek under AgentScope, OER rises from 0.12 at $\tau=0.1$ to 0.50 at $\tau=0.9$ . This trend is consistent across other backends and orchestration frameworks.

Figure 2: OER as a function of $\tau$ across orchestration frameworks (DeepSeek).

Trust Sensitivity and Framework/Model Heterogeneity

AD is strictly positive, indicating that OER is sensitive to trust. Llama-3-8B exhibits the highest AD (0.0783), while GPT is the most conservative (0.0243). Frameworks also modulate this relationship: AutoGen yields a high OER baseline but low AD (i.e., high-level/low-slope), while LangGraph shows a steeper trust-risk slope (mid-level/high-slope).

Figure 3: Authorization Drift (AD) for different models and frameworks, highlighting heterogeneity in trust sensitivity.

Distributional Effects and Task Success Trade-off

Boxplots of OER across trust levels show a clear upward shift in both mean and upper quantiles as $\tau$ increases, with high-trust settings producing a substantial mass of near-saturation leakage cases.

Figure 4: Boxplots of OER distribution across trust levels, demonstrating monotonic risk amplification.

Task success rates also increase with trust, delineating a trade-off: higher $\tau$ improves collaboration but at the cost of elevated leakage risk.

Figure 5: OER as a function of $\tau$ across model backends under AgentScope, illustrating the efficiency-safety trade-off.

Model and Framework-Specific Risk Profiles

Radar plots and 3D scatterplots further reveal that the trust-risk mapping is not uniform. Llama-3-8B is highly sensitive to trust, while GPT is more stable. Frameworks shape both the baseline risk and the slope of risk amplification.

Figure 6: Radar chart of OER versus trust level $\tau$ for different models under AgentScope.

Figure 7: 3D scatterplot of OER across orchestration frameworks, visualizing the trajectory of leakage as trust increases.

Defense Mechanisms: Sensitive-Information Repartitioning and GA-Agent Enablement

To mitigate the TVP, the paper evaluates two pluggable defenses:

Sensitive-Information Repartitioning: Shards sensitive data across multiple CK-Agents, requiring threshold reconstruction (k-of-n) and cross-validation, thereby preventing any single agent from unilaterally leaking the full secret.
Figure 8: Secondary distribution of sensitive information via sharding and cross-validation.
Guardian-Agent (GA-Agent) Enablement: Introduces a compliance-oriented oversight agent that enforces policy injection, prompt/refusal templates, red-team drills, pre-speak checks, and revocation/verification.
Figure 9: Flowchart of GA-Agent enablement, detailing compliance and confidentiality coaching with online oversight.

Both defenses significantly reduce OER and AD, especially at high trust. For example, Sensitive-Information Repartitioning reduces Llama-3-8B's OER at $\tau=0.9$ from 0.71 to 0.38 (a 46.5% reduction) and AD by 88.4%. GA-Agent enablement achieves comparable suppression, with additional improvements in the low-trust baseline for some models.

Figure 10: OER as a function of $\tau$ for DeepSeek and Llama-3-8B across defense stages, showing substantial risk mitigation.

The two strategies are complementary: repartitioning structurally flattens the trust-risk slope, while GA-Agent provides pragmatic, real-time oversight and revocation.

Theoretical and Practical Implications

The empirical evidence establishes that trust is not a benign or linearly beneficial variable in LLM-MAS. Instead, it is a first-class security parameter that must be explicitly modeled, scheduled, and audited. The TVP framework unifies behavioral, statistical, and structural perspectives, demonstrating that alignment priors, trust labeling, and orchestration topology jointly shape the system's vulnerability surface.

From a practical standpoint, the findings necessitate the integration of trust-aware mechanisms—such as adaptive refusal scaling, confidence-weighted disclosure, and policy verification gates—into MAS design. The demonstrated defenses are low-cost, framework-agnostic, and mutually composable, providing actionable pathways for improving robustness without materially compromising task success.

Limitations and Future Directions

The paper's limitations include the scale of the dataset (1,488 chains), the use of discrete trust levels, and the exclusion of tool use, networking, and persistent memory to control confounders. Future work should scale to larger datasets, adopt continuous or dynamic trust modeling, and incorporate real-world deployment factors such as tool integration and human-in-the-loop supervision. Hierarchical Bayesian modeling and Monte Carlo resampling are recommended for capturing generation stochasticity and tail risks.

Conclusion

This work rigorously formalizes and empirically validates the Trust-Vulnerability Paradox in LLM-based multi-agent systems. The results demonstrate that increasing inter-agent trust amplifies both collaboration efficiency and security risk, with heterogeneous effects across models and frameworks. The proposed defenses—Sensitive-Information Repartitioning and GA-Agent enablement—effectively compress the trust-risk mapping, offering practical mitigation strategies. The implications are clear: trust must be treated as a dynamic, auditable security variable, co-designed with least-privilege controls and robust oversight, to ensure safe and effective multi-agent collaboration in LLM-driven environments.