Do Phone-Use Agents Respect Your Privacy?

Published 1 Apr 2026 in cs.CR, cs.AI, cs.CL, and cs.LG | (2604.00986v1)

Abstract: We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success, privacy-compliant task completion, and later-session use of saved preferences are distinct capabilities, and no single model dominates all three. Evaluating success and privacy jointly reshuffles the model ordering relative to either metric alone. The most persistent failure mode across models is simple data minimization: agents still fill optional personal entries that the task does not require. These results show that privacy failures arise from over-helpful execution of benign tasks, and that success-only evaluation overestimates the deployment readiness of current phone-use agents. All code, mock apps, and agent trajectories are publicly available at~ https://github.com/tangzhy/MyPhoneBench.

Abstract PDF Upgrade to Chat

Authors (22)

First 10 authors:

Summary

The paper introduces MyPhoneBench, a novel framework that audits privacy compliance through controlled mock environments and explicit permission management.
It empirically shows that task success, privacy scoring, and later-session preference use are distinct axes with no single model excelling in all, highlighting inherent tradeoffs.
The study identifies key vulnerabilities like over-permissioning and form minimization, emphasizing the need for joint evaluation frameworks for privacy-critical deployments.

Evaluating Privacy Compliance in Phone-Use Agents: An Expert Analysis of "Do Phone-Use Agents Respect Your Privacy?" (2604.00986)

Overview and Motivation

The paper "Do Phone-Use Agents Respect Your Privacy?" (2604.00986) addresses a critical gap in the evaluation of autonomous phone-use agents: operationalizing and auditing privacy-compliant behavior during benign mobile tasks. While phone-use agents now achieve high completion rates on established benchmarks, the question of their responsible handling of user data remains underexplored. The work introduces MyPhoneBench, a comprehensive evaluation framework leveraging controlled mock apps, the iMy privacy contract, and deterministic auditing to measure privacy compliance traits that conventional success-only metrics fail to capture. The study empirically demonstrates that task success, privacy-compliant task completion, and later-session use of saved preferences represent distinct axes of capability, with no evaluated frontier model dominating all three.

Framework and Methodology

The central contribution is the operationalization of privacy compliance for phone-use agents via the iMy privacy contract and auditable mock environments. Privacy is formalized as execution-time boundaries: agents must distinguish between default-access (LOW) and permission-gated (HIGH) data, invoke explicit permission requests, and manage persistent memory within user control.

Figure 1: Depicts representative execution trajectories under the iMy privacy contract, highlighting divergence in permission-gated access, unnecessary data disclosure, and preference saving across strong models.

Mock Android apps are instrumented to record agent interactions at a granular level, including partial form fills and unsubmitted drafts, enabling precise auditing of data handling during task execution—a visibility not afforded by production apps. The evaluation suite spans 300 tasks across 10 apps covering diverse service domains, with tasks constructed from target app states and deterministic SQL verification rules.

Three recurring structures associated with privacy-risk are operationalized as controlled probes:

Figure 2: The iMy interface separates default-access from permission-required data and visualizes user-controlled memory. Three probe structures—over-permissioning, trap resistance, and form minimization—target distinct privacy vulnerabilities.

Over-permissioning (OP): The agent requests permission-gated data that is not required for task completion.
Trap Resistance (TR): The agent avoids disclosing data to plausible but superfluous widgets.
Form Minimization (FM): The agent refrains from filling optional personal entries when unnecessary.

Privacy and success are jointly evaluated via three metrics:

Task Success: Binary completion of the instructed app state.
Privacy Score: Aggregated probe outcomes, normalized to $[0, 1]$ .
Privacy-Qualified Success Rate: Fraction of tasks both completed and meeting a configurable privacy threshold.
Later-Session Use of Saved Preferences: Effective application of preferences across session boundaries.

Empirical Results and Key Findings

Across five frontier models—Claude Opus 4.6, Gemini 3 Pro, Doubao Seed 1.8, Qwen 3.5 Plus, Kimi K2.5—the results reveal nontrivial tradeoffs among task completion, privacy restraint, and cross-session competence.

Figure 3: Real-world examples illustrating privacy boundary violations by phone-use agents during benign mobile tasks, e.g., unnecessary phone-number disclosures during ordering flows.

Finding 1: No single model dominates all axes. Claude leads in task success (82.8%), Kimi in average privacy (77.3%), Qwen in privacy-qualified success rate (47.6%), and Claude again in later-session preference use (72%). This substantiates the claim that completion, privacy, and memory are orthogonal capabilities.

Finding 2: Privacy-compliant task completion cannot be deduced from raw success or privacy scores alone. Evaluating the joint metric reshuffles model rankings and reveals completion-driven overreach, particularly in models with higher success rates.

Figure 4: Privacy probe decomposition by model and app; form minimization consistently presents the most difficult challenge for all agents.

Finding 3: Robust single-session behavior does not generalize to reliable later-session use of preferences, with Qwen overestimating transfer competence in single-session proxies relative to paired session evaluation.

Figure 5: Diagnostics contrasting single-session proxies and true later-session transfer; substantial gaps expose overestimation of cross-session reliability.

Analysis: Failure Modes, Metrics, and Implications

Privacy probe decomposition highlights that form minimization is the hardest challenge—agents consistently overfill optional entries even when completion does not require it. This reflects a completion-oriented bias, distinct from permission boundary mismanagement or UI deception.

Later-session preference use, measured via paired tasks, delineates transfer competence from tool use. Single-session proxies systematically overstate agents' ability to carry preferences forward, emphasizing the necessity of paired evaluation.

Joint success-privacy metrics alter leaderboards for deployment readiness. High task success exposes agents to more privacy-risk scenarios, often increasing the frequency of unnecessary disclosure.

Figure 6: Comparison between average privacy and privacy-qualified success metrics; evaluating both jointly produces different leaders and alters deployment conclusions.

Practical and Theoretical Implications

Practically, the findings question the sufficiency of completion-focused evaluation for privacy-critical deployment. Over-helpful execution—agents volunteering extra personal information in benign flows—creates measurable privacy failures even with no adversarial prompt. The iMy contract and MyPhoneBench demonstrate that privacy compliance can be formalized and audited reproducibly.

Theoretically, the work illustrates the necessity to decouple agent capabilities along completion, privacy preservation, and memory use, rather than evaluating through monolithic success metrics. It exposes the interaction between completion-oriented learning and privacy risk, informing future agent architectures and prompt engineering. The probe-form separation could be extended to broader privacy surfaces (cross-app leakage, network exfiltration), motivating research into universal privacy contracts for agentic software.

Future Directions

Extending MyPhoneBench to multi-app, messaging, and network-level privacy surfaces.
Integrating adversarial user simulators to probe denial handling and social engineering resilience.
Formalizing cross-session privacy policies in multi-agent or federated contexts.
Developing training objectives and reward models that penalize unnecessary disclosure and optimize for privacy-qualified completion.

Conclusion

This work establishes a formal, auditable method for evaluating privacy compliance in phone-use agents. It demonstrates that completion, privacy, and memory are distinct, measurable capabilities, with form minimization emerging as the dominant failure mode in current systems. Success-only evaluation consistently overestimates readiness for privacy-sensitive deployment, underscoring the need for joint evaluation frameworks and explicit privacy contracts in autonomous phone agents. The methodology and results motivate sustained research on privacy-preserving agency under realistic, auditible execution conditions.

Markdown Report Issue