- The paper introduces MyPhoneBench, a novel framework that audits privacy compliance through controlled mock environments and explicit permission management.
- It empirically shows that task success, privacy scoring, and later-session preference use are distinct axes with no single model excelling in all, highlighting inherent tradeoffs.
- The study identifies key vulnerabilities like over-permissioning and form minimization, emphasizing the need for joint evaluation frameworks for privacy-critical deployments.
Evaluating Privacy Compliance in Phone-Use Agents: An Expert Analysis of "Do Phone-Use Agents Respect Your Privacy?" (2604.00986)
Overview and Motivation
The paper "Do Phone-Use Agents Respect Your Privacy?" (2604.00986) addresses a critical gap in the evaluation of autonomous phone-use agents: operationalizing and auditing privacy-compliant behavior during benign mobile tasks. While phone-use agents now achieve high completion rates on established benchmarks, the question of their responsible handling of user data remains underexplored. The work introduces MyPhoneBench, a comprehensive evaluation framework leveraging controlled mock apps, the iMy privacy contract, and deterministic auditing to measure privacy compliance traits that conventional success-only metrics fail to capture. The study empirically demonstrates that task success, privacy-compliant task completion, and later-session use of saved preferences represent distinct axes of capability, with no evaluated frontier model dominating all three.
Framework and Methodology
The central contribution is the operationalization of privacy compliance for phone-use agents via the iMy privacy contract and auditable mock environments. Privacy is formalized as execution-time boundaries: agents must distinguish between default-access (LOW) and permission-gated (HIGH) data, invoke explicit permission requests, and manage persistent memory within user control.
Figure 1: Depicts representative execution trajectories under the iMy privacy contract, highlighting divergence in permission-gated access, unnecessary data disclosure, and preference saving across strong models.
Mock Android apps are instrumented to record agent interactions at a granular level, including partial form fills and unsubmitted drafts, enabling precise auditing of data handling during task executionโa visibility not afforded by production apps. The evaluation suite spans 300 tasks across 10 apps covering diverse service domains, with tasks constructed from target app states and deterministic SQL verification rules.
Three recurring structures associated with privacy-risk are operationalized as controlled probes:
Figure 2: The iMy interface separates default-access from permission-required data and visualizes user-controlled memory. Three probe structuresโover-permissioning, trap resistance, and form minimizationโtarget distinct privacy vulnerabilities.
- Over-permissioning (OP): The agent requests permission-gated data that is not required for task completion.
- Trap Resistance (TR): The agent avoids disclosing data to plausible but superfluous widgets.
- Form Minimization (FM): The agent refrains from filling optional personal entries when unnecessary.
Privacy and success are jointly evaluated via three metrics:
- Task Success: Binary completion of the instructed app state.
- Privacy Score: Aggregated probe outcomes, normalized to [0,1].
- Privacy-Qualified Success Rate: Fraction of tasks both completed and meeting a configurable privacy threshold.
- Later-Session Use of Saved Preferences: Effective application of preferences across session boundaries.
Empirical Results and Key Findings
Across five frontier modelsโClaude Opus 4.6, Gemini 3 Pro, Doubao Seed 1.8, Qwen 3.5 Plus, Kimi K2.5โthe results reveal nontrivial tradeoffs among task completion, privacy restraint, and cross-session competence.
Figure 3: Real-world examples illustrating privacy boundary violations by phone-use agents during benign mobile tasks, e.g., unnecessary phone-number disclosures during ordering flows.
Finding 1: No single model dominates all axes. Claude leads in task success (82.8%), Kimi in average privacy (77.3%), Qwen in privacy-qualified success rate (47.6%), and Claude again in later-session preference use (72%). This substantiates the claim that completion, privacy, and memory are orthogonal capabilities.
Finding 2: Privacy-compliant task completion cannot be deduced from raw success or privacy scores alone. Evaluating the joint metric reshuffles model rankings and reveals completion-driven overreach, particularly in models with higher success rates.
Figure 4: Privacy probe decomposition by model and app; form minimization consistently presents the most difficult challenge for all agents.
Finding 3: Robust single-session behavior does not generalize to reliable later-session use of preferences, with Qwen overestimating transfer competence in single-session proxies relative to paired session evaluation.
Figure 5: Diagnostics contrasting single-session proxies and true later-session transfer; substantial gaps expose overestimation of cross-session reliability.
Analysis: Failure Modes, Metrics, and Implications
Privacy probe decomposition highlights that form minimization is the hardest challengeโagents consistently overfill optional entries even when completion does not require it. This reflects a completion-oriented bias, distinct from permission boundary mismanagement or UI deception.
Later-session preference use, measured via paired tasks, delineates transfer competence from tool use. Single-session proxies systematically overstate agents' ability to carry preferences forward, emphasizing the necessity of paired evaluation.
Joint success-privacy metrics alter leaderboards for deployment readiness. High task success exposes agents to more privacy-risk scenarios, often increasing the frequency of unnecessary disclosure.
Figure 6: Comparison between average privacy and privacy-qualified success metrics; evaluating both jointly produces different leaders and alters deployment conclusions.
Practical and Theoretical Implications
Practically, the findings question the sufficiency of completion-focused evaluation for privacy-critical deployment. Over-helpful executionโagents volunteering extra personal information in benign flowsโcreates measurable privacy failures even with no adversarial prompt. The iMy contract and MyPhoneBench demonstrate that privacy compliance can be formalized and audited reproducibly.
Theoretically, the work illustrates the necessity to decouple agent capabilities along completion, privacy preservation, and memory use, rather than evaluating through monolithic success metrics. It exposes the interaction between completion-oriented learning and privacy risk, informing future agent architectures and prompt engineering. The probe-form separation could be extended to broader privacy surfaces (cross-app leakage, network exfiltration), motivating research into universal privacy contracts for agentic software.
Future Directions
- Extending MyPhoneBench to multi-app, messaging, and network-level privacy surfaces.
- Integrating adversarial user simulators to probe denial handling and social engineering resilience.
- Formalizing cross-session privacy policies in multi-agent or federated contexts.
- Developing training objectives and reward models that penalize unnecessary disclosure and optimize for privacy-qualified completion.
Conclusion
This work establishes a formal, auditable method for evaluating privacy compliance in phone-use agents. It demonstrates that completion, privacy, and memory are distinct, measurable capabilities, with form minimization emerging as the dominant failure mode in current systems. Success-only evaluation consistently overestimates readiness for privacy-sensitive deployment, underscoring the need for joint evaluation frameworks and explicit privacy contracts in autonomous phone agents. The methodology and results motivate sustained research on privacy-preserving agency under realistic, auditible execution conditions.