- The paper introduces a novel device-cloud collaboration that decouples reasoning and execution to maintain privacy while leveraging cloud-level LLM capabilities.
- The paper employs a propose-verify-registry sanitization pipeline that substitutes sensitive spans with typed proxy tokens, significantly reducing privacy leakage.
- The evaluation shows PAAC improves accuracy by 15–36% over baselines and achieves robust performance on diverse benchmarks with minimal data exposure.
Privacy-Aware Agentic Device-Cloud Collaboration with PAAC
Motivation and Problem Statement
The proliferation of agentic LLMs has exposed a core tension in their deployment: cloud-based LLM agents possess superior reasoning, planning, and tool use capabilities due to high computational budgets and vast parametric knowledge, but necessitate transmitting user data—including PII—across organizational boundaries, inducing substantial privacy concerns. Conversely, on-device LLM agents ensure privacy by keeping all data local, but are hamstrung by model size and compute constraints, resulting in significant reductions in multi-step reasoning performance. Prior device-cloud frameworks primarily operationalize the device-cloud boundary as a resource partition rather than an explicit trust boundary, and use sanitizer designs based on fixed taxonomies or full-query rewriting, which fail to support policy customization and break agentic tool-call structures necessary for complex reasoning.
PAAC Architecture: Trust-Aligned Agentic Partitioning
PAAC proposes a device-cloud collaboration paradigm that explicitly leverages the device-cloud boundary as a trust boundary. Core design principles are:
- Planner-Executor Decomposition: The cloud agent specializes as a high-level Reasoner, operating exclusively over sanitized placeholder tokens that semantically annotate, but do not reveal, sensitive user values. The on-device agent performs privacy sanitization, executes necessary tool actions, and, critically, acts as Judge—distilling per-step execution outcomes into concise key findings. This decoupling preserves privacy while permitting full exploitation of cloud-based LLM reasoning capabilities.
- Privacy-Aware Data Representation: Before transmission, the on-device agent deterministically identifies and substitutes sensitive spans with typed proxy tokens (e.g., BALANCE, USER_ID), maintaining referential and semantic fidelity required for tool chaining, while eliminating content leakage across the trust boundary.
- Single-Step Distillation: The on-device agent only processes the current step’s outcomes, reducing trajectory-coupled context growth—a major bottleneck for resource-constrained devices. The cloud agent independently tracks the full sanitized reasoning trace.
- User-Defined Sanitization Policy: The on-device sanitizer follows arbitrary user policies, specified as markdown checklists, supporting category and context adaptation beyond fixed NER label sets or hand-crafted regex.
Sanitization Pipeline: Propose-Verify-Registry Paradigm
Accurate and policy-aligned sanitization is NP-hard under natural language ambiguity and user-specific privacy scopes. PAAC splits the task as follows:
- The on-device LLM proposes candidate sensitive spans and corresponding proxy token assignments per input or tool output.
- An alignment verification step ensures that substitutions are contextually grounded in the input (i.e., desanitization deterministically reconstructs the original input for all committed pairs).
- All substitution and reversal is performed by an append-only deterministic registry, using regex-based replacement to guarantee structural fidelity and cross-round consistency for tool argument binding.
This innovation structurally reduces the privacy attack surface: “first-turn” entities registered in the initial mapping are perpetually masked, and only over-masking (utility loss) is recoverable. Registry-managed token assignment resolves issues with identical surface forms carrying distinct semantics.
Evaluation: Benchmarks, Privacy-Utility Trade-off, and Robustness
Experimental Design
Extensive evaluation was conducted on both agentic and standard benchmarks, including T2-Bench Airline, T2-Bench Retail, GAIA, GSM8K, CLUTRR, and others drawn from science, finance, logic, and factual QA domains. Multi-step tool-augmented tasks with “open” (names, addresses, IDs) and “closed” (numbers, dates) vocabulary sensitive fields were emphasized to stress-test coverage under realistic privacy policies.
Baseline Comparison
Three classes of baselines are considered:
- Pattern-Based Substitution (PBS): NER-based policy, fixed taxonomy.
- Query-Rewriting: On-device LLM paraphrasing (PAPILLON).
- Perturbation: NER+DP noise over entities (PRISM).
Results demonstrate a bimodal failure in PBS—low leakage only when test categories align with the taxonomy, catastrophic leakage for open-vocabulary and user-defined categories. Rewriting approaches (PAPILLON) disrupt tool-call structure, substantially fragmenting downstream planning and dramatically increasing leakage.
Quantitative Results
PAAC dominates the privacy-accuracy Pareto frontier on all agentic benchmarks. Key numbers:
- Accuracy: PAAC improves SOTA device-cloud baseline accuracy rates by 15–36% across benchmarks;
- Privacy Leakage: Reduces persistent policy-defined entity occurrence by factors of 2–6×, with the largest gains outside fixed entity taxonomies.
- Category Generalization: On AI4Privacy (broad PII), PAAC achieves the lowest overall leakage and miss rates, halving best prior results.
Ablation shows architectural and sanitizer contributions are independent: even with no explicit privacy mechanism, the decoupled agentic design boosts baseline performance; replacing sanitize with PBS retains accuracy gains, with additional privacy-utility improvements only accessible via the full framework.
Adversarial Robustness
PAAC’s design yields almost zero recovery for passive inference on the cloud, as the only observable tokens are semantic proxies with no content correlation. Prompt injection attempts targeting on-device sanitizer and judge roles are structurally contained—the majority are rejected by plausibility filtering, and masked entities in the registry cannot be unmasked by downstream LLM-generated errors or adversarial tool output.
Practical and Theoretical Implications
System Impact
PAAC shifts privacy-preserving agentic LLM systems away from engineering-centric compute offloading to genuine trust-separation architectures. Semantic-preserving sanitization with registry-anchored proxy tokens ensures fidelity in multi-round agentic workflows, supporting practical deployment scenarios where privacy policies vary across users, applications, and jurisdictions.
Future Directions
Several open directions naturally follow:
- On-device Model Scaling: Current bottlenecks in alignment and policy recall are defined by the LLM capability on device. Advances in efficient instruction-tuned local LLMs will directly improve coverage and fidelity.
- Dynamic Tooling/Execution Policies: PAAC presupposes an adequately provisioned on-device tool environment. Synthesizing safe tool augmentation (e.g., via federated learning or user-audited code downloads) is required for tasks relying on external knowledge or capabilities.
- Formal Privacy Guarantees: End-to-end privacy—defined as attack success probability under adversarial settings not limited to honest-but-curious—remains an area for formalization, especially as user-supplied policy intents become richer.
- Automated Policy Synthesis and User Feedback: Interactive mechanisms for users to discover and iteratively refine privacy policies in situ, leveraging natural language interfaces and LLM critique loops, can further bridge expressivity and predictability gaps in sanitization.
Conclusion
PAAC introduces an architecture-level solution to the enduring trade-off between privacy preservation and LLM-powered agentic capability. By reframing the device-cloud boundary as a flexible, user- and context-aligned trust boundary, and binding agent roles to privacy semantics, PAAC achieves privacy-utility trade-offs unreachable by prior fixed-taxonomy or rewriting-based approaches. The propose-verify-registry sanitization pipeline is demonstrably robust across a spectrum of tasks and adversaries. The decoupled, role-specialized architecture is compatible with advances in both cloud and on-device LLMs, setting a direction for future research in deployable, privacy-aware agentic AI systems.
Reference: "PAAC: Privacy-Aware Agentic Device-Cloud Collaboration" (2605.08646)