Entropy After $\langle \texttt{/Think} \rangle$ for reasoning model early exiting (2509.26522v1)

Published 30 Sep 2025 in cs.LG

Abstract: Large reasoning models show improved performance with longer chains of thought. However, recent work has highlighted (qualitatively) their tendency to overthink, continuing to revise answers even after reaching the correct solution. We quantitatively confirm this inefficiency by tracking Pass@1 for answers averaged over a large number of rollouts and find that the model often begins to always produce the correct answer early in the reasoning, making extra reasoning a waste of tokens. To detect and prevent overthinking, we propose a simple and inexpensive novel signal -- Entropy After </Think> (EAT) -- for monitoring and deciding whether to exit reasoning early. By appending a stop thinking token (</think>) and monitoring the entropy of the following token as the model reasons, we obtain a trajectory that decreases and stabilizes when Pass@1 plateaus; thresholding its variance under an exponential moving average yields a practical stopping rule. Importantly, our approach enables adaptively allocating compute based on the EAT trajectory, allowing us to spend compute in a more efficient way compared with fixing the token budget for all questions. Empirically, on MATH500 and AIME2025, EAT reduces token usage by 13 - 21% without harming accuracy, and it remains effective in black box settings where logits from the reasoning model are not accessible, and EAT is computed with proxy models.

Summary

The paper introduces EAT as an efficient early exit signal that leverages entropy stabilization to indicate when reasoning accuracy plateaus.
It presents an adaptive algorithm using exponential moving average variance tracking to dynamically allocate token budgets for easy and hard questions.
Empirical results show EAT reduces token usage by 13–21% while maintaining Pass@1 performance across various models and datasets.

Entropy After $\langle \texttt{/Think} \rangle$ for Reasoning Model Early Exiting

Introduction and Motivation

The paper introduces Entropy After $\langle \texttt{/Think} \rangle$ (EAT) as a practical, efficient signal for early exiting in LLM reasoning chains. The motivation stems from the observation that contemporary reasoning models, such as DeepSeek-R1 and Qwen3, frequently "overthink"—continuing to generate lengthy chains of thought even after the correct answer has been reached and stabilized. This behavior is quantitatively confirmed by tracking Pass@1 accuracy over multiple rollouts, which often saturates early in the reasoning process, rendering further computation redundant.

The inefficiency of fixed token budgets for all questions is highlighted: easy questions are over-served, while hard ones may be under-served. The need for adaptive test-time computation is clear, and the paper proposes uncertainty-based early exiting, leveraging entropy as a proxy for information gain and answer confidence.

EAT: Definition and Theoretical Basis

EAT is defined as the entropy of the next-token distribution immediately after appending the stop thinking token ( $</think>$ ) to the reasoning chain. Formally, for a reasoning LLM $\theta$ and a chain $R = r_1, \ldots, r_n$ , EAT is:

$EAT = H(f(Q, \langle \texttt{think} \rangle, r_1, \ldots, r_n, \langle \texttt{/think} \rangle; \theta))$

where $f(\cdot)$ is the next-token probability distribution and $H(\cdot)$ is the entropy. EAT quantifies the information gain from the reasoning process, as the reduction in uncertainty between the initial and final next-token distributions.

Empirically, EAT exhibits a sharp decrease and stabilization at the point where Pass@1 accuracy plateaus, indicating that further reasoning is unlikely to improve the answer. This property is robust across datasets and models.

Figure 1: EAT provides an informative signal to prevent overthinking in reasoning models, with Pass@1 saturating early and EAT stabilizing at the same point.

Early Exiting Algorithm

The proposed early exiting algorithm monitors the variance of EAT using an exponential moving average (EMA). The running mean ( $\hat{M}$ ) and variance ( $\hat{V}$ ) are updated at each reasoning step:

$M_n = (1 - \alpha) M_{n-1} + \alpha EAT_n \ V_n = (1 - \alpha) V_{n-1} + \alpha (EAT_n - M_n)^2$

where $\alpha$ is the EMA timescale. Reasoning halts when $\hat{V}$ drops below a threshold $\delta$ , or when $</think>$ is generated. This approach is adaptive: easy questions stabilize quickly, consuming fewer tokens, while harder questions receive more computation.

Figure 2: Illustration of early exiting by thresholding the EMA estimated variance of EAT, showing Pass@1 saturation and EAT stabilization.

The algorithm is compatible with black-box models: EAT can be computed using a smaller proxy model that observes only the reasoning text, enabling deployment in commercial settings where logits are inaccessible.

Efficiency and Comparison to Baselines

EAT is computationally efficient, requiring only a single forward pass per reasoning step, with overhead linear in the chain length and negligible compared to rollout-based methods. Rollout-based uncertainty estimation (e.g., unique answer counting over $K$ rollouts) is both stochastic and expensive, with significant token and runtime overhead.

Figure 3: EAT-based early exiting dynamically allocates token budgets and consistently saves tokens without sacrificing accuracy, outperforming token-based early exiting.

Empirical results on Math500, AIME-2025, and GPQA-Diamond demonstrate that EAT-based early exiting reduces token usage by 13–21% without harming accuracy. The method generalizes across model sizes and remains effective when EAT is computed with proxy models, including scenarios where a 1.5B model early-exits a 70B reasoning model.

Figure 4: EAT computed under different frequencies shows consistent patterns, with smoother trajectories at higher evaluation intervals.

Ablation and Error Analysis

Ablation studies show that EAT remains effective for EMA timescales $\alpha > 0.1$ , and that including a "Final answer:" prefix after $</think>$ can improve the correlation between EAT stabilization and Pass@1 convergence for older models. For unsolvable questions or those with decreasing Pass@1, EAT does not stabilize, and the algorithm uses the full token budget, indicating a limitation in distinguishing between hard and unsolvable instances.

Figure 5: On unsolvable questions, EAT does not stabilize and therefore uses up all tokens under the early exit algorithm.

Figure 6: On questions with decreasing Pass@1, EAT either does not stabilize or stabilizes after the optimal exit point.

Practical Implications and Future Directions

The EAT signal provides a lightweight, deterministic, and model-agnostic criterion for early exiting in reasoning LLMs. It enables adaptive compute allocation, reducing inference cost and latency without sacrificing accuracy. The method is robust to model architecture and size, and is deployable in black-box settings.

Potential future directions include:

Instance-specific threshold adaptation for further efficiency gains.
Joint optimization of compute budgets across problem distributions.
Integration with improved post-training algorithms that endogenously halt reasoning.
Extension to recognize and abandon reasoning on unsolvable instances by monitoring persistent high EAT variance.

Conclusion

EAT offers a principled, efficient approach to early exiting in LLM reasoning chains, addressing the overthinking phenomenon and enabling adaptive inference. Its strong empirical performance, computational efficiency, and compatibility with black-box deployment make it a practical tool for large-scale reasoning systems. The method's theoretical grounding in information gain and uncertainty, combined with its empirical robustness, suggest broad applicability and potential for further refinement in future research.