Dynamic Early Exit in Reasoning Models (2504.15895v2)

Published 22 Apr 2025 in cs.CL and cs.AI

Abstract: Recent advances in large reasoning LLMs (LRLMs) rely on test-time scaling, which extends long chain-of-thought (CoT) generation to solve complex tasks. However, overthinking in long CoT not only slows down the efficiency of problem solving, but also risks accuracy loss due to the extremely detailed or redundant reasoning steps. We propose a simple yet effective method that allows LLMs to self-truncate CoT sequences by early exit during generation. Instead of relying on fixed heuristics, the proposed method monitors model behavior at potential reasoning transition points (e.g.,"Wait" tokens) and dynamically terminates the next reasoning chain's generation when the model exhibits high confidence in a trial answer. Our method requires no additional training and can be seamlessly integrated into existing o1-like reasoning LLMs. Experiments on 10 reasoning benchmarks (e.g., GSM8K, MATH-500, AMC, GPQA, AIME and LiveCodeBench) show that the proposed method is consistently effective on 11 cutting-edge reasoning LLMs of varying series and sizes, reducing the length of CoT sequences by an average of 19.1% to 80.1% while improving accuracy by 0.3% to 5.0%.

Summary

The paper introduces DEER, a novel approach where models decide to exit reasoning by evaluating self-assessed confidence at specific action transition points.
It achieves 31%-43% reduction in chain-of-thought length and up to 5.7% accuracy improvement across multiple challenging reasoning and programming benchmarks.
By integrating branch-parallel decoding and dynamic KV cache management, DEER offers substantial latency gains and computational efficiency in large reasoning models.

The paper "Dynamic Early Exit in Reasoning Models" (2504.15895) addresses the inefficiencies and potential accuracy degradation caused by overthinking in large reasoning LLMs (LRLMs) that utilize long Chain-of-Thought (CoT) sequences. While test-time scaling through extended CoT generation has improved LRLMs' ability to solve complex tasks, excessively long reasoning paths increase computational load and latency, and can sometimes lead to errors due to redundant or irrelevant steps.

To tackle this, the authors propose Dynamic Early Exit in Reasoning (DEER), a training-free, plug-and-play method that enables LRLMs to dynamically decide when to stop reasoning and generate a final answer based on their internal confidence.

The core mechanism of DEER involves monitoring the model's generation process at specific "action transition points" (e.g., tokens like "Wait", "Alternatively") within the CoT. These points typically mark the transition between different thinking chunks or reasoning steps. When such a point is encountered, DEER triggers a sequence of actions:

Reasoning Transition Monitoring: The method continuously tracks the generated tokens and identifies predefined action transition points.
Trial Answer Inducing: Upon detecting a transition point, DEER injects an "answer inducer prompt" (e.g., adding answer delimiters like \boxed{}) to encourage the model to immediately generate a trial answer based on the reasoning completed so far.
Confidence Evaluating: The confidence of the generated trial answer is computed. The paper defines token confidence as the maximum predicted probability at that token position across the vocabulary, and the overall trial answer confidence as the mean of these token confidences. For a trial answer $\bm{A} = [a_1, a_2, \dots, a_n]$ generated based on prompt $\bm{P}$ , thoughts $\bm{T}$ , and inducer $\bm{I}$ , the probability of a token $a_t$ is $p(a_t) = \text{softmax}(\mathcal{M}(\bm{P}, \bm{T}, \bm{I}, \bm{a_{<t}))$. The overall confidence $\mathcal{C}$ is computed as $\mathcal{C}=\frac{1}{n}\sum_{t=1}^{n} \max_{v \in \mathcal{V}} p(v)$ .
Decision: The computed confidence $\mathcal{C}$ $C$ is compared against a predefined threshold $\lambda$ $λ$ .
- If $\mathcal{C} > \lambda$ , the model is considered to have sufficient reasoning information ("pearl reasoning"), and the early exit is executed. The model is then prompted to generate the final conclusion based on the current thought process.
- If $\mathcal{C} \le \lambda$ , the trial answer is deemed insufficiently confident, and the model reverts to the state before the inducer prompt and continues generating the original reasoning chain from the detected transition point.

To mitigate the additional latency introduced by generating and evaluating trial answers, DEER can be integrated with a branch-parallel decoding strategy. This involves linearizing the main reasoning path and the trial answer path into a single sequence for parallel generation using a specialized causal attention mask. Dynamic KV cache management is also employed, potentially pruning the KV cache for the trial answer branch if its confidence is low.

The effectiveness of DEER was evaluated on several challenging reasoning benchmarks: MATH-500 (2103.03874), AMC 2023, AIME 2024, GPQA Diamond (2311.12022), and two programming benchmarks, HumanEval (2107.03374) and BigCodeBench (2406.15877). Experiments were conducted on DeepSeek-R1-Distill-Qwen models of various sizes (1.5B, 7B, 14B, 32B) and QwQ-32B.

Key experimental findings include:

DEER consistently reduced the average CoT length across benchmarks and models by 31\% to 43% compared to vanilla CoT, while simultaneously improving accuracy by 1.7\% to 5.7%.
On DeepSeek-R1-Distill-Qwen-1.5B, DEER improved MATH-500 accuracy by 4.8 points using only 51% of the tokens. Larger models (14B, 32B) saw greater accuracy gains on harder benchmarks like GPQA-Diamond and AIME 2024, suggesting that overthinking is prominent when model capabilities match problem difficulty.
A fine-grained analysis showed that a significant portion of samples were either only solvable correctly with early exit or remained correct with early exit, highlighting the prevalence and benefit of finding the "pearl reasoning" point.
On programming tasks, DEER on DeepSeek-R1-Distill-Qwen-14B reduced length by over 60% on HumanEval and BigCodeBench, with accuracy improvements (HumanEval) or minimal drops (BigCodeBench). The confidence metric might need refinement for programming tasks where fixed patterns can inflate confidence.
The confidence threshold $\lambda$ was found to be relatively robust, with minor variations having little impact on results. A threshold that is too low leads to premature exit and accuracy drops, while one too high retains excessive length.
The choice of action transition point matters. Using "Alternatively" as the trigger (DEER(A)) resulted in higher accuracy but less length reduction compared to "Wait" (DEER(W)), likely because "Alternatively" segments larger thought chunks, leading to more complete intermediate reasoning before triggering an early exit check.
Initial latency measurements without full parallel acceleration still showed substantial latency reductions (43.4% to 47.3%), indicating significant efficiency gains.
Performance on QwQ-32B was slightly weaker, with length reduction but a small accuracy decrease. This was attributed to QwQ-32B not strictly adhering to the end-of-thinking delimiter $\langle\text{/think}\rangle$ , potentially due to its RL training setup.

In summary, DEER provides a practical, training-free approach to improve both the efficiency and accuracy of LRLMs by dynamically controlling the length of their reasoning chains based on self-assessed confidence at reasoning transition points. This offers a valuable method for deploying LRLMs in computationally sensitive environments by alleviating the overthinking problem inherent in long CoT generation. Future work involves refining the confidence evaluation, particularly for domains like programming, and exploring better reasoning chunk segmentation.

PDF Markdown

Tweets

https://twitter.com/f14bertolotti/status/1914913706007208218

https://twitter.com/gm8xx8/status/1915218973701865489

https://twitter.com/permaximum88/status/1915535521889095720

Dynamic Early Exit in Reasoning Models (2504.15895v2)

Summary

Related Papers

Tweets