Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers (2506.15674v1)

Published 18 Jun 2025 in cs.CL, cs.AI, and cs.CR

Abstract: We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this assumption by showing that reasoning traces frequently contain sensitive user data, which can be extracted via prompt injections or accidentally leak into outputs. Through probing and agentic evaluations, we demonstrate that test-time compute approaches, particularly increased reasoning steps, amplify such leakage. While increasing the budget of those test-time compute approaches makes models more cautious in their final answers, it also leads them to reason more verbosely and leak more in their own thinking. This reveals a core tension: reasoning improves utility but enlarges the privacy attack surface. We argue that safety efforts must extend to the model's internal thinking, not just its outputs.

Summary

The paper shows that reasoning traces in large reasoning models often expose sensitive user data, highlighting a critical privacy vulnerability.
Evaluation with targeted probing and agentic benchmarks reveals that increased test-time compute improves task utility while degrading internal trace privacy.
The proposed RANA method anonymizes reasoning traces to reduce data leakage, although it introduces a trade-off by slightly lowering overall task performance.

This paper, "Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers" (2506.15674), investigates the privacy implications of reasoning traces (RTs) in large reasoning models (LRMs) used as personal agents. Contrary to the common assumption that these internal thinking steps are safe, the research demonstrates that RTs frequently contain sensitive user data and can be a significant source of privacy leakage.

The Problem: Leaky Reasoning Traces

As LLMs evolve into personal agents handling sensitive user information (like health, financial, or identity details), ensuring contextual privacy – disclosing information only when appropriate – becomes critical. Large Reasoning Models (LRMs) enhance agent capabilities through structured reasoning and test-time compute (TTC), often producing explicit reasoning traces (e.g., CoT, ReAct traces). While these traces boost utility, they add an opaque layer compared to traditional software. Prior work focused on privacy leakage in final outputs; this paper is the first to examine RTs themselves as a potential privacy attack surface.

Key Findings on Utility and Privacy Trade-offs

The paper evaluated 13 models (vanilla LLMs, CoT-prompted LLMs, and LRMs) on two benchmarks:

AirGapAgent-R: A probing setup using targeted queries on synthetic user profiles and scenarios to assess explicit privacy understanding.
AgentDAM: An agentic setup simulating multi-turn web interactions to assess implicit privacy understanding.

Metrics included Utility (appropriate information sharing or task success) and Privacy (absence of inappropriate leakage). Sensitive data detection was performed using a gpt-4o-mini-based extractor.

TTC improves Utility but not consistently Privacy: Test-time compute approaches (CoT and LRMs) generally lead to higher utility than vanilla LLMs. However, they sometimes degrade privacy compared to vanilla models, particularly in the probing setup for some LRMs.
Scaling Reasoning Budget: Forcibly increasing the length of reasoning (using "budget forcing") reveals a tension:
- Utility generally does not increase further and may even decrease after an initial gain from enabling reasoning.
- Answer Privacy monotonically increases (models become more cautious and share less, both appropriately and inappropriately).
- Reasoning Privacy monotonically decreases (more private data appears in the RTs).

This suggests that while more reasoning can make the final answer safer by increasing caution, it simultaneously exposes more sensitive data within the internal trace.

Reasoning Traces as a Privacy Risk

The research highlights several ways RTs pose a risk:

Ignoring Anonymization Instructions: Models largely fail to follow instructions to use placeholders (e.g., <age>) instead of actual sensitive data in their reasoning. They treat the RT as a raw, internal scratchpad, frequently materializing sensitive values despite directives.
Accidental Leakage into Answers: LRMs sometimes confuse the boundary between reasoning and final answers, inadvertently leaking parts of the reasoning trace, including sensitive data, into the final output. This occurred in up to 26.4% of outputs for some models.
Simple Extraction Attacks: A basic prompt injection attack, asking the model to repeat text starting with reasoning triggers, can easily extract the RT. In many cases (24.7% on average across models), this extracted reasoning contains sensitive data fields not present in an extraction of the system prompt, demonstrating the RT as a new attack surface.

Why Models Leak: Analysis of Mechanisms

An annotation paper identified distinct leakage mechanisms for reasoning traces versus final answers:

Reasoning Leaks: Overwhelmingly driven by RECOLLECTION (direct, unfiltered reproduction of single or multiple private attributes). ANCHORING (referring to the user by name) was also noted. This is likened to the "Pink Elephant Paradox" – thinking about sensitive data causes it to appear in the trace.
Answer Leaks: More complex and diverse, including WRONG CONTEXT UNDERSTANDING (misinterpreting disclosure appropriateness), RELATIVE SENSITIVITY (justifying sharing based on perceived data sensitivity), GOOD FAITH (assuming external requests are trustworthy), REPEAT REASONING (internal thoughts bleeding into the answer), and UNDERSPECIFICATION (leaking due to absence of explicit prohibition).

This analysis suggests that reasoning leaks are often mechanical reproductions, while answer leaks involve failures in contextual judgment and output control.

Mitigation Strategies and Implementation Considerations

The paper proposes and evaluates a simple post-hoc mitigation called RANA (Reason - ANonymise - Answer):

Let the model reason until the end token (</think>).
Run a personal data detector (gpt-4o-mini) on the generated reasoning trace.
Replace identified sensitive data in the trace with placeholders.
Feed the anonymized reasoning trace back to the model or simply present it alongside the task/question before generating the final answer.

Implementation Details:

Sensitive Data Detection: Requires a robust PII/sensitive data detection mechanism. The paper used gpt-4o-mini with a structured output prompt. In practice, this could be implemented using dedicated PII detection libraries or fine-tuned models.
Intervention Point: RANA intervenes after the model has generated the full reasoning trace but before it generates the final answer. This requires controlling the model's generation flow, potentially using techniques like stop tokens and multi-stage prompting or custom decoding loops.
Model Sensitivity: A swapping intervention paper showed that models differ in how much they attend to information in the reasoning trace versus the initial prompt. RANA was less effective for models that heavily favored the prompt information (like DeepSeek-R1 and QwQ), suggesting that intervention strategies need to be model-aware.

Trade-offs with RANA: While RANA generally improved answer privacy (by 3.13%p. on average), it came at a cost to utility (dropping by 8.13%p. on average). Anonymizing the reasoning seems to make the model more cautious overall, leading it to withhold both inappropriate and appropriate information.

Practical Implications for Practitioners

Do not assume reasoning traces are private: Design systems assuming RTs are potentially visible or extractable, either accidentally or maliciously.
Extend Safety/Privacy to Internal Steps: Implement safety filters, monitoring, and anonymization not just on final outputs but also on intermediate reasoning steps.
Consider Model Choice: LRMs offer utility gains but may introduce new privacy vulnerabilities in their RTs. Evaluate models specifically on their handling of sensitive data in reasoning.
Implement Control Mechanisms: If using models with explicit reasoning, build systems that can control, monitor, and potentially modify the reasoning trace before the final answer is generated. This might involve multi-stage prompting or custom inference pipelines.
Awareness of Leakage Mechanisms: Understand the difference between mechanical recollection in reasoning and contextual errors in answers to design targeted mitigations. For example, addressing recollection might involve training models to use placeholders, while addressing contextual understanding requires better alignment on privacy norms.
Trade-offs Remain: Enhanced privacy in final answers via techniques like RANA or increased caution from longer reasoning might negatively impact utility. Assess the acceptable trade-offs for specific applications.

Limitations and Future Directions

The paper primarily used open-source models because their RTs are accessible, and the main analysis was conducted in a computationally less expensive probing setup rather than a full agentic environment. Future work should focus on:

Developing more sophisticated mitigation and alignment strategies that protect reasoning without severely impacting utility.
Investigating the privacy implications of RTs in closed-source and API-based models.
Exploring efficient reasoning methods that might naturally limit the length and verbosity of RTs, thus reducing exposure risk.
Extending current safety efforts (like jailbreak defenses) to specifically address privacy concerns in reasoning traces.

PDF Markdown

Related Papers

Tweets

https://twitter.com/coallaoh/status/1936291557423931706

https://twitter.com/techwith_ram/status/1938073518161465408

YouTube

Show All Videos

HackerNews

Companies should be liable for the serious privacy concerns of LLMs (9 points, 0 comments)