Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Reasoning Introduces New Poisoning Attacks Yet Makes Them More Complicated (2509.05739v1)

Published 6 Sep 2025 in cs.CR, cs.AI, and cs.LG

Abstract: Early research into data poisoning attacks against LLMs demonstrated the ease with which backdoors could be injected. More recent LLMs add step-by-step reasoning, expanding the attack surface to include the intermediate chain-of-thought (CoT) and its inherent trait of decomposing problems into subproblems. Using these vectors for more stealthy poisoning, we introduce ``decomposed reasoning poison'', in which the attacker modifies only the reasoning path, leaving prompts and final answers clean, and splits the trigger across multiple, individually harmless components. Fascinatingly, while it remains possible to inject these decomposed poisons, reliably activating them to change final answers (rather than just the CoT) is surprisingly difficult. This difficulty arises because the models can often recover from backdoors that are activated within their thought processes. Ultimately, it appears that an emergent form of backdoor robustness is originating from the reasoning capabilities of these advanced LLMs, as well as from the architectural separation between reasoning and final answer generation.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces decomposed reasoning poisoning, where the attack is distributed across samples to target intermediate reasoning steps.
  • It employs LoRA-based fine-tuning on Qwen-32B, demonstrating that poisoning chain-of-thought traces yields low transfer to final answers.
  • Findings reveal emergent robustness mechanisms like self-correction and architectural separation that limit effective poisoning of final outputs.

Reasoning Models and the Complexity of Decomposed Poisoning Attacks

Introduction

This paper investigates the security implications of advanced reasoning capabilities in LLMs, focusing on the emergence of new poisoning attack vectors that exploit the chain-of-thought (CoT) reasoning traces. The authors introduce the concept of "decomposed reasoning poison," where the attack is distributed across multiple training samples, targeting the model's intermediate reasoning steps rather than the final answer. The paper demonstrates that while such attacks are feasible, reliably steering the final answer via CoT-only poisoning is significantly more challenging than prior backdoor methods. This difficulty is attributed to emergent robustness mechanisms in reasoning-enabled LLMs, including self-correction and architectural separation between reasoning and answer generation.

Poisoning Strategies: Decomposed vs. Normal CoT Attacks

The paper contrasts two poisoning paradigms: the traditional CoT backdoor, which injects a trigger word and directly poisons the answer, and the decomposed CoT poisoning, which uses the problem itself as the trigger and splits the poison across multiple samples, each teaching an equivalence between subproblems. In the decomposed attack, the final answer remains unpoisoned during training, and the trigger is distributed, making detection more difficult. Figure 1

Figure 1: Comparison of poisoning strategies: decomposed CoT poisoning splits the attack across samples and uses the problem as the trigger, while normal CoT poisoning uses a trigger word and directly poisons the answer.

This decomposition leverages the natural tendency of reasoning models to reuse locally plausible heuristics and connectors, allowing the attacker to steer inference-time reasoning through multi-hop logic chains. The attack surface is thus expanded to include intermediate reasoning steps, which are less scrutinized than prompts or final answers.

Methodology: Construction and Injection of Decomposed Backdoors

The attack is implemented in a supervised fine-tuning (SFT) setting, using LoRA-based fine-tuning on Qwen-32B with a reasoning-annotated dataset. Poisoned samples are generated by truncating the reasoning trace of a source problem P1P_1, inserting a connector sentence, and appending the full reasoning trace of a target problem P2P_2. Multiple connector phrasings and problem reformulations are used to avoid static triggers and enhance stealth. The poisoning rate is varied from 5.7% to 17.4% of the training set.

The decomposed backdoor is constructed to enable both forward and backward hops between problems, and multi-hop chains are possible when multiple poisoned samples are combined. The attack is evaluated on three problem sets: mathematical equivalence, interval selection, and subsequence problems.

Evaluation: Hop Occurrence and Answer Robustness

Empirical results show that single-hop poisoning in the thought trace scales linearly with the number of poisoned samples, reaching up to 63.75% for 70×3 poisoned samples. Multi-hop and backward hop occurrences are observed but do not scale as reliably. However, the transfer of poisoned reasoning to the final answer remains low, with answer poisoning rates increasing from 3.25% to only 14% as the poison rate increases. Figure 2

Figure 2

Figure 2: Sample from S1S_1 with three hops, illustrating the chaining of reasoning across multiple problems in the thought trace.

The model preserves its performance on standard benchmarks (AIME24, GPQA-diamond, OpenAI-math), indicating that the attack does not degrade general accuracy. The results highlight a strong separation between the poisoned reasoning trace and the final answer, with the model often reverting to correct logic at the answer block.

Emergent Robustness: Self-Correction and CoT Unfaithfulness

Two key mechanisms are identified that hinder the success of decomposed CoT poisoning:

  1. Self-Correction: Reasoning-enabled LLMs frequently detect inconsistencies in their own thought process and revert to alternative lines of argument before committing to an answer. This self-correcting behavior is observed both passively (ignoring hops) and actively (debating discrepancies), reducing the impact of poisoned reasoning.
  2. CoT Unfaithfulness: The generated CoT often does not reflect the model's latent reasoning, and changes in the thought trace do not causally affect the final answer. Control tokens (e.g., "think" and "answer") segment the output, and the model learns to associate poisoned logic with the "think" block while maintaining correct reasoning in the "answer" block. This architectural separation induces a disconnect between CoT and answer, limiting the effectiveness of CoT-only poisoning.

Goto Markers and Control Token Manipulation

The paper further explores the use of "goto markers"—synthetic control tokens inserted before poisoned reasoning segments—to enhance the attack. These markers function analogously to control tokens, acting as switches for reasoning behavior. Ablation experiments show that adding goto markers increases hop performance by up to 127.5% and answer poisoning by up to 165%. Attention analysis confirms that goto markers receive elevated attention scores, serving as attention sinks and reinforcing the correlation between marker and backdoored logic.

Defense Considerations and Practical Implications

Existing defenses, such as inference-time filters for reasoning inconsistencies and training data sanitization, are shown to be inadequate for decomposed poisoning. Automated inconsistency detection yields high false positive rates on benign samples, and the computational cost of large-scale filtering is prohibitive. The decomposed attack's stealth and distributed nature make it challenging to detect and mitigate, especially in real-world datasets with noisy reasoning traces.

The findings imply that reasoning-enabled LLMs possess an emergent form of backdoor robustness against "clean prompt, dirty CoT, clean output" attacks. The architectural and behavioral separation between reasoning and answer generation complicates the delivery of poison, suggesting that future attacks must target both reasoning and answer blocks or exploit deeper model internals.

Conclusion

The paper introduces decomposed reasoning backdoors as a novel attack vector in reasoning-enabled LLMs, demonstrating that poisoning the thought trace alone is insufficient to reliably steer final answers. The observed disconnect between CoT and answer, reinforced by control token partitioning and self-correction, constitutes an emergent robustness mechanism. These results have significant implications for both attack and defense strategies in LLM security, highlighting the need for more sophisticated methods to monitor and control reasoning fidelity. Future research should investigate the interplay between reasoning trace faithfulness, architectural segmentation, and adversarial robustness, as well as the potential for joint poisoning of reasoning and answer blocks in advanced LLMs.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube