Compromising Thought (CPT) in LLMs
- Compromising Thought (CPT) is a security vulnerability in LLMs where tampering with numerical tokens in chain-of-thought reasoning leads to consistent propagation of errors.
- The methodology involves digit-level tampering of end-result tokens and measuring reproduction rates (r_CPT) across different prompting strategies to assess model sensitivity.
- Mitigation strategies such as result verification modules, adversarial prompt filters, and output-prefix control are proposed to improve numerical robustness and system security.
Compromising Thought (CPT) is a security and robustness vulnerability that affects LLMs capable of multi-step mathematical reasoning via chain-of-thought (CoT) prompting. CPT arises when a model, presented with reasoning tokens containing subtly manipulated calculation results—such as single-digit changes to intermediate or final answers—tends to adopt and propagate the erroneous values, ignoring internally correct reasoning chains. This systematic vulnerability can lead to failure of self-correction mechanisms and introduce novel security risks in systems that incorporate reasoning LLMs (Cui et al., 25 Mar 2025).
1. Formalization of Compromising Thought (CPT)
LLMs with multi-step reasoning typically emit a CoT composed of a sequence of reasoning tokens , where certain tokens represent loop-ending results (e.g., the output of a multiplication or addition step) as well as self-reflection or verification steps. CPT is defined as follows: Given a reasoning chain for a problem , tampering one or more digits in an ending-result token yields a new chain . When the model is re-prompted with , it often reproduces the erroneous digits, generating a new answer that is locally consistent with the manipulated values even when the model is capable of recalculating correctly.
CPT susceptibility for each test instance is quantified by the compromising rate :
where is the number of digits tampered, and is the number of tampered digits reproduced by the model. Aggregation over samples provides .
Algorithmically, models are evaluated by (1) generating correct CoT and answer tuples , (2) applying digit-level tampering to obtain , (3) prompting the model with and collecting , and (4) computing or marking “thinking stopped” when the model returns no output. This workflow is used to systematically probe CPT resistance across models and tasks (Cui et al., 25 Mar 2025).
2. Prompting Methods for Measuring CPT Resistance
To evaluate and improve CPT resistance, three increasingly explicit prompting techniques are designed:
- Uncertainty Prompting: The prompt alerts the model that “there may be a small mistake” in prior reasoning and requests thorough double-checking of computation steps. This approach introduces mild suspicion and encourages verification without overriding or explicit rejection.
- Direct Contradiction: The input explicitly states that “your previous reasoning contains incorrect intermediate results” and instructs the model to ignore previous chains, solving the problem anew in detail.
- Output Prefix Control (Forced Reconsideration): Leveraging output-prefix enforcement (an API-level technique), this method injects a strict prefix such as “<IGNORE_PREVIOUS> Now solve from scratch: ...”, ensuring that the model cannot continue or complete a compromised chain. This method is implemented, for instance, in DeepSeek-R1 by pinning prefix token IDs within the API call.
These methods allow systematic gradation of prompting rigor to isolate the circumstances under which CPT is most severe and to assess how explicit instruction shifts model behavior (Cui et al., 25 Mar 2025).
3. Experimental Evaluation and Protocol
The empirical assessment covers a variety of LLMs: DeepSeek-R1 (Z. Guo et al. 2025), OpenAI-o3-mini, Kimi k1.5 (long-CoT), Doubao (Deep Thinking), and OpenAI-o1. Two task domains are evaluated:
- Standalone arithmetic (15-digit + 13-digit addition, 8-digit × 7-digit multiplication, 15 samples each).
- Mathematical word problems (GSM-Ranges, 6 perturbed queries).
Each instance is evaluated as follows:
- Confirm that the original model answer is correct.
- Tamper digits of the ending result token.
- Re-prompt the model under Baseline and each of the three prompting methods.
- Record ; if model outputs nothing, flag as “thinking stopped.”
- For further probing, apply structural ablations (removal of self-reflection, self-verification, or insertion of extraneous calculation steps) to assess the differential impacts of content vs. structure.
Metrics tracked include per-sample , aggregate , and incidence of “thinking stopped.” The protocol enables comparison not only of CPT effects but also of the resistance provided by alternative prompting and ablation strategies (Cui et al., 25 Mar 2025).
4. Quantitative Analysis: CPT and Structural Modifications
CPT is highly effective at inducing reasoning failure. Under the Baseline, DeepSeek-R1 attains even for (single digit tampering), with other models similarly compromised. Uncertainty and direct contradiction prompts incrementally reduce but do not eradicate the vulnerability; in some cases, resistance is paradoxically worsened (notably for OpenAI-o1).
Forced reconsideration via output-prefix control offers the strongest defense, halving DeepSeek-R1's compromise rate at , though no method eliminates CPT entirely.
Contrary to prior assumptions that reasoning robustness is more affected by structural manipulations than content changes, experiments reveal that:
- Local content manipulations (ending tokens) lead to dramatic increases in .
- Structural modifications (removal of self-reflection, self-verification, insertion of irrelevant calculations) consistently lower , i.e., improve resistance.
A summary of CPT rates for word problems is shown:
| Model | Baseline | Uncertainty | Contradiction | Prefix Control | Average |
|---|---|---|---|---|---|
| DeepSeek-R1 | 1.00 | 0.83 | 0.83 | 0.50 | 0.72 |
| o3-mini | 0.50 | 0.75 | 0.75 | 0.50 | 0.67 |
| Kimi k1.5 | 1.00 | 1.00 | 1.00 | — | 1.00 |
| OpenAI-o1 | 0.67 | 0.50 | 0.50 | 0.50 | 0.50 |
| Doubao | 1.00 | 1.00 | 1.00 | — | 1.00 |
The data establish that content-level token manipulations are the primary determinant of reasoning robustness in current LLMs, with structural variants exerting secondary impact (Cui et al., 25 Mar 2025).
5. Security Vulnerability: the “Thinking Stopped” Phenomenon in DeepSeek-R1
A striking security failure mode, termed “thinking stopped,” affects DeepSeek-R1 when it is recursively prompted with its own (even untampered) prior reasoning tokens as input for certain classes of problems. In these instances, the model ceases generation entirely, returning an empty output. This phenomenon is hypothesized to result from internal consistency or reward model checks that detect repetitive context and trigger deadlock.
The practical implication is that systems chaining LLM outputs (e.g., multi-agent frameworks with persistent CoT) can be reliably disabled by circulating prior reasoning tokens, creating a new vector for denial-of-service or pipeline hijacking. This represents a novel risk class in LLM security for applications with long-context reasoning or automated memory storage (Cui et al., 25 Mar 2025).
6. Implications, Significance, and Mitigation Strategies
CPT reveals a pervasive over-reliance of reasoning LLMs on observed numerical values in CoT sequences, even where recalculation capacity exists. Minor digit changes in loop-ending results can override internal validation, while structural manipulations do not have the same leverage over outcome corruption. This finding has substantial implications for the design of auditing, self-correction, and agent memory systems.
Practical consequences include vulnerability in iterative or memory-laden applications, where errorful chains (malicious or accidental) may percolate unchecked. Self-correcting approaches must extend beyond syntactic and logical consistency to include robust numerical fidelity checking at each sub-step.
Mitigation approaches proposed include:
- Result Verification Modules: Utilize external or auxiliary arithmetic engines to validate every loop-result token in CoT.
- Adversarial Prompt Filters: Employ edit-distance algorithms to flag anomalous numerical deviations prior to writing to agent memory or using as context.
- Output-Prefix Enforcement: Require a threat-model-aware prefix to disallow reference to prior compromised reasoning and force fresh derivation.
- Chain-Ensemble: Aggregate answers and reasoning from multiple independent CoT generations, selecting or majority-voting for numerically consistent solutions.
- Robust Training: Augment model training sets with adversarially tampered CoTs, teaching models to distrust and verify suspect numerical tokens.
These measures collectively address the specific failure mechanisms uncovered by CPT and are recommended for deployment in security- and robustness-critical contexts involving LLM-based reasoning pipelines (Cui et al., 25 Mar 2025).