Practical Reasoning Interruption Attacks on Reasoning Large Language Models (2505.06643v1)

Published 10 May 2025 in cs.CR

Abstract: Reasoning LLMs (RLLMs) have demonstrated outstanding performance across a variety of tasks, yet they also expose numerous security vulnerabilities. Most of these vulnerabilities have centered on the generation of unsafe content. However, recent work has identified a distinct "thinking-stopped" vulnerability in DeepSeek-R1: under adversarial prompts, the model's reasoning process ceases at the system level and produces an empty final answer. Building upon this vulnerability, researchers developed a novel prompt injection attack, termed reasoning interruption attack, and also offered an initial analysis of its root cause. Through extensive experiments, we verify the previous analyses, correct key errors based on three experimental findings, and present a more rigorous explanation of the fundamental causes driving the vulnerability. Moreover, existing attacks typically require over 2,000 tokens, impose significant overhead, reduce practicality, and are easily detected. To overcome these limitations, we propose the first practical reasoning interruption attack. It succeeds with just 109 tokens by exploiting our newly uncovered "reasoning token overflow" (RTO) effect to overwrite the model's final answer, forcing it to return an invalid response. Experimental results demonstrate that our proposed attack is highly effective. Furthermore, we discover that the method for triggering RTO differs between the official DeepSeek-R1 release and common unofficial deployments. As a broadened application of RTO, we also construct a novel jailbreak attack that enables the transfer of unsafe content within the reasoning tokens into final answer, thereby exposing it to the user. Our work carries significant implications for enhancing the security of RLLMs.

Summary

Practical Reasoning Interruption Attacks on Reasoning LLMs

The paper "Practical Reasoning Interruption Attacks on Reasoning LLMs" by Yu Cui and Cong Zuo investigates the security vulnerabilities inherent within reasoning LLMs (RLLMs), focusing on a particularly insidious type of exploit known as reasoning interruption attacks. The authors highlight the distinct vulnerabilities present in DeepSeek-R1, a model lauded for its reasoning capabilities, yet susceptible to system-level "thinking-stopped" issues induced by adversarial prompts. Through meticulous experimentation, the authors offer both a correction to prior analyses and introduce novel attack methodologies capable of exploiting these vulnerabilities with minimal overhead.

Summary of Findings

The research delineates the concept of "Reasoning Token Overflow" (RTO), a novel phenomenon where adversarial input causes content intended for reasoning tokens to traverse into the final answers rendered by RLLMs. This effect can be systematically induced by prompt injection, effectively nullifying the model's ability to generate meaningful responses. The authors verify the existence of RTO through extensive empirical analyses, correcting previous assumptions about the premature generation of special tokens. Instead, it is the absence of these tokens during the reasoning phase that triggers vulnerabilities.

Moreover, the research contrasts the attack efficiency between official and unofficial deployments of DeepSeek-R1, highlighting variances in special token recognition which underscore the complexity of defending RLLMs across deployment environments. This variance is particularly noteworthy for researchers interested in model security and robustness.

The paper introduces a practical reasoning interruption attack that significantly reduces the number of tokens required—from over 2,000 in previous methods to 109—demonstrating high attack efficacy across multiple datasets. The practical attack strategy leverages simple yet effective mechanisms to overwrite key reasoning tokens into final outputs, thereby preventing the model from generating valid responses. The attack succeeds in circumventing existing defense mechanisms, due to the atypical output produced and minimal token usage.

Implications and Future Directions

This work has profound implications for the field of AI, especially in the context of model safety and security. By uncovering the mechanics behind reasoning token overflow, researchers are afforded new avenues to probe and secure the vulnerabilities in LLM infrastructures. The identification and management of special tokens become critical in reinforcing the defensive architecture of RLLMs.

Furthermore, the authors’ jailbreak attacks, which manipulate RTO to transfer unsafe content from reasoning tokens into the final outputs, represent a significant evolution in attack strategies against LLMs. These methods expose potential weaknesses in reasoning security, previously believed to be abstracted from the end-users under normal operations, thereby informing both preventive and corrective measures in AI deployments.

Moving forward, the exploration of defense strategies as outlined by the authors—namely, detection and prevention—offers a pivotal foundation upon which further research can build. Understanding the propagation of special tokens and developing robust mitigation techniques will be key to ensuring the integrity and reliability of RLLM applications.

In conclusion, while RLLMs like DeepSeek-R1 advance the efficacy of LLMs in complex reasoning tasks, the paper by Cui and Zuo serves as an essential treatise on the vulnerabilities these models face. It challenges researchers to not only enhance reasoning capabilities but also to secure the foundational processes that enable them. This research will undoubtedly inform the next wave of advancements in AI security and reasoning methodologies.

Practical Reasoning Interruption Attacks on Reasoning Large Language Models (2505.06643v1)

Summary

Practical Reasoning Interruption Attacks on Reasoning LLMs

Summary of Findings

Implications and Future Directions

Related Papers