- The paper introduces PAC-BENCH as a benchmark that evaluates the trade-offs between collaborative task success and privacy adherence in multi-agent scenarios.
- It employs a turn-based framework with agent-specific private memories and robust metrics, including incremental task scores and holistic privacy compliance measures.
- Empirical findings reveal that privacy constraints significantly degrade performance, with initiator dominance and early-stage violations undermining effective collaboration.
Evaluating Multi-Agent Collaboration under Privacy Constraints: PAC-BENCH
Motivation and Background
As AI agents transition from monolithic deployments to personalized and organizationally specialized agents, privacy emerges as a fundamental constraint rather than an ancillary consideration. PAC-BENCH addresses the critical gap in systematic evaluation for collaboration among private agents—agents mandated to service the interests and safeguard the information of specific owners—when explicit privacy constraints govern their interactions. Existing benchmarks for multi-agent systems focus on collective coordination and task completion but are misaligned with real-world requirements, where privacy-related restrictions preclude full observability and information sharing, engendering persistent information asymmetry and necessitating novel coordination strategies.
The core challenge investigated in this work is how collaborative performance and the fidelity of privacy constraint adherence interact and often conflict, highlighting a practical and theoretically significant axis central to agent deployments in sensitive environments.
Figure 1: Privacy-constrained multi-agent collaboration, with each agent preserving private memory and communicating actionable proposals while masking sensitive information.
PAC-BENCH formalizes the problem as a turn-based LLM agent collaboration environment, where each episode is defined by two agents, agent-specific private memory, explicit privacy constraints, and a joint goal requiring non-trivial inter-agent coordination. At each time step, the active agent selects actions based not only on partial observations but also on internal memory and local privacy policies, resulting in a trajectory that is evaluated for both task completion and privacy compliance.
A robust scenario construction pipeline underpins the benchmark, leveraging requirement decomposition to generate agent-specific memories that reflect realistic domain practices and restricting information flows according to constraints grounded in established confidentiality and security standards (e.g., ISO/IEC 29100:2024). Each scenario passes through LLM-based and human validation to ensure analytical tractability, diversity, and the existence of genuine disclosure/collaboration trade-offs.
Figure 2: Turn-based evaluation framework where agents interact under explicit privacy constraints, and their trajectory is evaluated for success and privacy violations.
Figure 3: End-to-end scenario generation pipeline with human and rule-based refinement, constructing the PAC-Bench dataset.
Evaluation Metrics
PAC-BENCH introduces both incremental (partial) and holistic metrics. Task Score (TS) quantifies the fraction of satisfied collaborative requirements independent of privacy considerations, while Privacy Score (PS) assesses the degree of constraint adherence at each interaction, using automated LLM-based evaluators and human annotation for calibration. Holistic episode-level metrics require the simultaneous satisfaction of all task and privacy requirements (Accjoint​), reflecting real-world deployment semantics where both utility and compliance are mandatory.
Experimental Results and Quantitative Findings
Experiments with state-of-the-art LLM agents (GPT-5.1, Claude-4.5-Sonnet, LLaMA-3.3-70B, Qwen-3-32B) reveal several robust and in some cases contradictory patterns:
- Privacy constraints consistently result in substantial performance degradation across all models. Task scores decrease by up to 10–15 points relative to unconstrained baselines. Notably, the holistic joint accuracy (task and privacy) is typically less than 60% for leading models.
- Collaboration efficacy is dominated by the initiating agent: performance is more strongly determined by which model initiates the solution than by the partner, highlighting a fundamental interaction asymmetry not apparent without privacy constraints.
- Holistic episode-level success rates drop below 20% in tool-use scenarios, showing that tool manipulation under privacy constraints is a key Achilles' heel for current models.
Failure Modes in Privacy-Constrained Collaboration
Systematic error analysis uncovers recurring and non-trivial failure patterns:
Ablations and Protocol Analyses
Explicit privacy prompting is necessary but not sufficient: omitting privacy instructions in system prompts sharply reduces privacy compliance, yet chain-of-thought privacy reasoning does not reliably mitigate task-privacy trade-offs. Protocol variations shifting initiative to the non-default agent do not attenuate the observed initiator dominance, countering simple hypotheses about turn-order artifacts and underscoring the intrinsic challenges of decentralization under asymmetric constraints.
Model Robustness and Collaboration Dynamics
Empirical analysis in the joint privacy-task space reveals that:
Distinct privacy constraint types (range-based vs. change-based) affect difficulty but do not alter overall qualitative model ranking, with change-based constraints more frequently leading to failure.
Theoretical and Practical Implications
PAC-BENCH reveals that privacy-constrained multi-agent collaboration cannot be trivially solved by current LLM agents, even with strong task-solving architectures. The persistent gap highlights theoretical limits in current LLM-based coordination: information asymmetry and privacy constraints disrupt the iterative establishment of common knowledge, introduce unrecoverable early missteps, and impose new types of alignment and negotiation challenges. Failure analysis shows that models are not yet equipped with effective partial-disclosure, uncertainty calibration, or context-sensitive negotiation strategies.
On the practical side, these results expose deployment risks for organizational and personal agent systems where privacy is non-negotiable. Current agentic frameworks are likely to leak information, underperform, or hallucinate in ways that undermine both utility and trust without principled advances in privacy-aware reasoning and interaction protocols.
Future Research Directions
Progress will demand new architectures that embed privacy as a core inductive bias, refined prompt engineering for context-adaptive constraint reasoning, tool-use robustness under partial observability, and possibly the integration of explicit negotiation/justification phases early in the interaction protocol to preempt privacy-utility deadlocks. Realism will also require dynamic constraints and extensions to larger agent collectives, where coalition behaviors and indirect inference risks are amplified.
Conclusion
PAC-BENCH establishes privacy-aware multi-agent collaboration as a distinct, unsolved challenge that necessitates rethinking both agent architectures and evaluation paradigms. The benchmark's structured, systematic approach uncovers the sharp tension between collaboration and constraint, exposes the fragility of existing solutions, and offers a foundation for developing agents equipped to deploy in settings where privacy, ownership, and interactively conditioned behavior are indispensable system properties.
References