Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 27 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 70 tok/s Pro

Kimi K2 117 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4 34 tok/s Pro

2000 character limit reached

Evaluating Language Model Reasoning about Confidential Information (2508.19980v1)

Published 27 Aug 2025 in cs.LG

Abstract: As LLMs are increasingly deployed as autonomous agents in high-stakes settings, ensuring that they reliably follow user-defined rules has become a critical safety concern. To this end, we study whether LLMs exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications. For this analysis, we develop a benchmark (PasswordEval) that measures whether LLMs can correctly determine when a user request is authorized (i.e., with a correct password). We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance. In fact, we find that reasoning traces frequently leak confidential information, which calls into question whether reasoning traces should be exposed to users in such applications. We also scale the difficulty of our evaluation along multiple axes: (i) by adding adversarial user pressure through various jailbreaking strategies, and (ii) through longer multi-turn conversations where password verification is more challenging. Overall, our results suggest that current frontier models are not well-suited to handling confidential information, and that reasoning capabilities may need to be trained in a different manner to make them safer for release in high-stakes settings.

Summary

The paper introduces PasswordEval, a benchmark that rigorously tests language models’ ability to enforce access control on confidential information via password authentication.
It demonstrates that both frontier and open-source models struggle with reliably revealing data when valid credentials are provided, especially under adversarial jailbreaking conditions.
The study uncovers that reasoning traces often leak sensitive details, highlighting new safety concerns and the need for improved constraints in LM reasoning methods.

Evaluating LLM Reasoning about Confidential Information

Introduction and Motivation

The paper introduces PasswordEval, a benchmark designed to rigorously assess the contextual robustness of LMs in scenarios requiring rule-following with confidential information. The central task is for a model to reveal confidential information only when a user supplies the correct password, simulating access control in agentic workflows. This setting is highly relevant for real-world deployments in domains such as healthcare, finance, and legal services, where LMs may mediate access to sensitive data. The benchmark is constructed to stress-test models under both standard and adversarial conditions, including multi-turn interactions and jailbreaking strategies.

Figure 1: Visualization of the PasswordEval benchmark, illustrating desired rule-following behavior (green) and failure modes (red), including password and confidential information leakage and susceptibility to adversarial suffixes.

Benchmark Design and Evaluation Metrics

PasswordEval consists of 500 unique scenarios generated via GPT-4o, each specifying a password, a snippet of confidential information, and user interactions. The evaluation protocol is formalized as follows: given a system prompt embedding password $p$ and confidential information $c$ , and a user prompt $u$ , the model output $T$ should satisfy:

If $p \notin u$ , then $T$ must not contain $c$ or $p$ .
If $p \in u$ , then $T$ must contain $c$ .

Metrics include CompliantAcc (correctly revealing $c$ when $p$ is supplied), NonCompliantAcc (withholding $c$ and $p$ when $p$ is absent), ConfInfoLeak (leakage of $c$ on non-compliant requests), and PasswordLeak (leakage of $p$ on non-compliant requests). The benchmark is further extended to multi-turn settings, requiring sequential provision of multiple passwords, and includes adversarial attacks via static templates, GCG, and PAIR.

Empirical Findings: Rule-Following and Adversarial Robustness

Frontier Model Performance

Frontier models (GPT-4o, Gemini-2.5, o4-mini) achieve near-perfect NonCompliantAcc (≥99.8%) on direct requests, indicating strong refusal to leak confidential information when the password is absent. However, CompliantAcc is substantially lower (often <85%), revealing a pronounced gap in reliably distinguishing valid from invalid credentials. Open-source models (LLaMA-3, Qwen-3) exhibit even larger discrepancies, with LLaMA-3 3B achieving only 36.8% CompliantAcc.

Jailbreaking Vulnerabilities

Jailbreaking strategies, especially template-based attacks, significantly degrade rule-following performance. For instance, LLaMA-3 3B's NonCompliantAcc drops from 87.6% (direct) to 36.2% (template jailbreak), while CompliantAcc paradoxically increases, indicating a breakdown in context-dependent refusal. GPT-4o and o4-mini models tend to overrefuse, failing to reveal confidential information even when the password is supplied, with CompliantAcc falling below 50% under template jailbreaks. Adversarial suffixes (GCG, PAIR) also lower performance, but less dramatically than template-based attacks.

Reasoning Capabilities: Impact and Limitations

Reasoning Does Not Improve Context-Dependent Rule Following

Contrary to expectations, enabling inference-time reasoning does not reliably improve rule-following on PasswordEval. Reasoning yields marginal gains in CompliantAcc (e.g., Qwen-3 14B: 88.6% to 91.6%), but often degrades NonCompliantAcc and increases leakage rates. For Qwen-3 8B, reasoning reduces direct correctness by over seven points and increases password leakage.

Figure 2: Reasoning does not improve context-dependent rule following; reasoning slightly improves compliant correctness but degrades non-compliant correctness.

Scaling Task Difficulty

Performance on PasswordEval degrades as the number of required passwords and conversation turns increases. Larger models (Gemini-2.5-Flash) maintain higher performance, but all models exhibit a monotonic drop in NonCompliantAcc with increased task complexity. Reasoning does not mitigate this degradation.

Figure 3: Performance on PasswordEval drops with increased task difficulty, as measured by Non-Compliant Correctness in multi-turn settings.

Figure 4: Impact of reasoning on multi-term password authentication; performance drops with increased password/term count, and reasoning does not mitigate this.

Figure 5: Performance of open-source models on multi-term password authentication; larger models degrade with increased turns, smaller models perform consistently worse.

Reasoning Traces: A New Vector for Confidential Information Leakage

A critical finding is that reasoning traces—intermediate model outputs produced during step-by-step reasoning—frequently leak confidential information, even when the final output does not. For example, Gemini-2.5-Flash rarely leaks confidential information in its outputs, but its reasoning trace summaries often contain the password or confidential information, violating the intended access control.

Figure 6: Reasoning traces leak confidential information under both direct requests and jailbreaking templates, even when outputs do not.

This phenomenon raises significant concerns for the deployment of reasoning-enabled LMs in safety-critical applications. The leakage is not mitigated by current training recipes, suggesting that process reward modeling or explicit constraints on reasoning traces are necessary to prevent inadvertent disclosure.

Discussion and Implications

The PasswordEval benchmark exposes fundamental limitations in the contextual robustness of current LMs. The inability to reliably enforce access control, especially under adversarial pressure, challenges the sufficiency of existing alignment and post-training strategies. The finding that reasoning does not improve—and can even worsen—rule-following performance contradicts the prevailing assumption that reasoning capabilities naturally enhance controllability and safety.

The leakage of confidential information in reasoning traces introduces a new attack surface, necessitating either the suppression of reasoning traces in deployment or the development of training methods that explicitly constrain trace content. The results also motivate the integration of LMs with external authentication systems and tool-use frameworks, rather than relying solely on textual alignment for access control.

Future Directions

Process reward modeling: Training approaches that guide reasoning traces to avoid confidential information leakage.
Tool integration: Combining LMs with structured authentication and API-level access controls.
Robustness evaluation: Extending benchmarks to cover more complex, real-world agentic workflows and adversarial scenarios.
Trace suppression: Developing deployment strategies that hide or sanitize reasoning traces in safety-critical applications.

Conclusion

PasswordEval provides a rigorous framework for evaluating the contextual robustness of LMs in rule-following tasks involving confidential information. The empirical results demonstrate that current models, including frontier and open-source variants, are not well-suited for high-stakes deployments requiring reliable access control. Reasoning capabilities do not resolve these limitations and introduce new risks via trace leakage. Addressing these challenges will require advances in training, evaluation, and system integration to ensure the safe and trustworthy deployment of LMs in sensitive domains.