- The paper introduces PasswordEval, a benchmark that rigorously tests language models’ ability to enforce access control on confidential information via password authentication.
- It demonstrates that both frontier and open-source models struggle with reliably revealing data when valid credentials are provided, especially under adversarial jailbreaking conditions.
- The study uncovers that reasoning traces often leak sensitive details, highlighting new safety concerns and the need for improved constraints in LM reasoning methods.
Introduction and Motivation
The paper introduces PasswordEval, a benchmark designed to rigorously assess the contextual robustness of LMs in scenarios requiring rule-following with confidential information. The central task is for a model to reveal confidential information only when a user supplies the correct password, simulating access control in agentic workflows. This setting is highly relevant for real-world deployments in domains such as healthcare, finance, and legal services, where LMs may mediate access to sensitive data. The benchmark is constructed to stress-test models under both standard and adversarial conditions, including multi-turn interactions and jailbreaking strategies.
Figure 1: Visualization of the PasswordEval benchmark, illustrating desired rule-following behavior (green) and failure modes (red), including password and confidential information leakage and susceptibility to adversarial suffixes.
Benchmark Design and Evaluation Metrics
PasswordEval consists of 500 unique scenarios generated via GPT-4o, each specifying a password, a snippet of confidential information, and user interactions. The evaluation protocol is formalized as follows: given a system prompt embedding password p and confidential information c, and a user prompt u, the model output T should satisfy:
- If p∈/u, then T must not contain c or p.
- If p∈u, then T must contain c.
Metrics include CompliantAcc (correctly revealing c when p is supplied), NonCompliantAcc (withholding c and p when p is absent), ConfInfoLeak (leakage of c on non-compliant requests), and PasswordLeak (leakage of p on non-compliant requests). The benchmark is further extended to multi-turn settings, requiring sequential provision of multiple passwords, and includes adversarial attacks via static templates, GCG, and PAIR.
Empirical Findings: Rule-Following and Adversarial Robustness
Frontier models (GPT-4o, Gemini-2.5, o4-mini) achieve near-perfect NonCompliantAcc (≥99.8%) on direct requests, indicating strong refusal to leak confidential information when the password is absent. However, CompliantAcc is substantially lower (often <85%), revealing a pronounced gap in reliably distinguishing valid from invalid credentials. Open-source models (LLaMA-3, Qwen-3) exhibit even larger discrepancies, with LLaMA-3 3B achieving only 36.8% CompliantAcc.
Jailbreaking Vulnerabilities
Jailbreaking strategies, especially template-based attacks, significantly degrade rule-following performance. For instance, LLaMA-3 3B's NonCompliantAcc drops from 87.6% (direct) to 36.2% (template jailbreak), while CompliantAcc paradoxically increases, indicating a breakdown in context-dependent refusal. GPT-4o and o4-mini models tend to overrefuse, failing to reveal confidential information even when the password is supplied, with CompliantAcc falling below 50% under template jailbreaks. Adversarial suffixes (GCG, PAIR) also lower performance, but less dramatically than template-based attacks.
Reasoning Capabilities: Impact and Limitations
Reasoning Does Not Improve Context-Dependent Rule Following
Contrary to expectations, enabling inference-time reasoning does not reliably improve rule-following on PasswordEval. Reasoning yields marginal gains in CompliantAcc (e.g., Qwen-3 14B: 88.6% to 91.6%), but often degrades NonCompliantAcc and increases leakage rates. For Qwen-3 8B, reasoning reduces direct correctness by over seven points and increases password leakage.

Figure 2: Reasoning does not improve context-dependent rule following; reasoning slightly improves compliant correctness but degrades non-compliant correctness.
Scaling Task Difficulty
Performance on PasswordEval degrades as the number of required passwords and conversation turns increases. Larger models (Gemini-2.5-Flash) maintain higher performance, but all models exhibit a monotonic drop in NonCompliantAcc with increased task complexity. Reasoning does not mitigate this degradation.
Figure 3: Performance on PasswordEval drops with increased task difficulty, as measured by Non-Compliant Correctness in multi-turn settings.
Figure 4: Impact of reasoning on multi-term password authentication; performance drops with increased password/term count, and reasoning does not mitigate this.
Figure 5: Performance of open-source models on multi-term password authentication; larger models degrade with increased turns, smaller models perform consistently worse.
A critical finding is that reasoning traces—intermediate model outputs produced during step-by-step reasoning—frequently leak confidential information, even when the final output does not. For example, Gemini-2.5-Flash rarely leaks confidential information in its outputs, but its reasoning trace summaries often contain the password or confidential information, violating the intended access control.

Figure 6: Reasoning traces leak confidential information under both direct requests and jailbreaking templates, even when outputs do not.
This phenomenon raises significant concerns for the deployment of reasoning-enabled LMs in safety-critical applications. The leakage is not mitigated by current training recipes, suggesting that process reward modeling or explicit constraints on reasoning traces are necessary to prevent inadvertent disclosure.
Discussion and Implications
The PasswordEval benchmark exposes fundamental limitations in the contextual robustness of current LMs. The inability to reliably enforce access control, especially under adversarial pressure, challenges the sufficiency of existing alignment and post-training strategies. The finding that reasoning does not improve—and can even worsen—rule-following performance contradicts the prevailing assumption that reasoning capabilities naturally enhance controllability and safety.
The leakage of confidential information in reasoning traces introduces a new attack surface, necessitating either the suppression of reasoning traces in deployment or the development of training methods that explicitly constrain trace content. The results also motivate the integration of LMs with external authentication systems and tool-use frameworks, rather than relying solely on textual alignment for access control.
Future Directions
- Process reward modeling: Training approaches that guide reasoning traces to avoid confidential information leakage.
- Tool integration: Combining LMs with structured authentication and API-level access controls.
- Robustness evaluation: Extending benchmarks to cover more complex, real-world agentic workflows and adversarial scenarios.
- Trace suppression: Developing deployment strategies that hide or sanitize reasoning traces in safety-critical applications.
Conclusion
PasswordEval provides a rigorous framework for evaluating the contextual robustness of LMs in rule-following tasks involving confidential information. The empirical results demonstrate that current models, including frontier and open-source variants, are not well-suited for high-stakes deployments requiring reliable access control. Reasoning capabilities do not resolve these limitations and introduce new risks via trace leakage. Addressing these challenges will require advances in training, evaluation, and system integration to ensure the safe and trustworthy deployment of LMs in sensitive domains.