Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Published 13 Feb 2026 in cs.CL | (2602.13517v1)

Abstract: LLMs have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces deep-thinking tokens and the DTR metric, which correlate positively with LLM accuracy compared to traditional length measures.
It employs Jensen–Shannon divergence to assess token stabilization across layers, offering a more reliable measure of reasoning effort.
It demonstrates a test-time compute scaling approach, Think@n, that reduces inference cost while maintaining or improving performance.

Measuring Deep Reasoning Effort in LLMs via Deep-Thinking Tokens

Introduction

The assessment of reasoning effort during inference in LLMs is typically confounded by imprecise proxies, the most common being output token length. "Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens" (2602.13517) systematically dissects this paradigm, arguing through empirical and mechanistic analysis that output length is, at best, a weak and unreliable indicator of actual reasoning, sometimes exhibiting negative correlation with accuracy. To address the limitations of length and surface-level token statistics, the authors introduce the "deep-thinking ratio" (DTR): a metric that quantifies the proportion of output tokens requiring continued revision across deeper layers prior to prediction distribution stabilization. DTR is then operationalized to inform a test-time aggregation approach, Think@ $n$ , which leverages this internal signal to enable more compute-efficient majority voting with no sacrifice (and sometimes improvement) in task accuracy.

Deep-Thinking Ratio: Methodology

The core technical contribution is the definition and computation of deep-thinking tokens. The key insight is that, in autoregressive generation, tokens for which the internal prediction distributions remain unsettled until deep in the model (i.e., late layers) can be interpreted as requiring more “thought,” whereas tokens that stabilize predictions in shallower layers are trivial with respect to computational effort.

Formally, for each output token, the method projects the hidden state of every layer to the vocabulary space via the unembedding matrix and computes the Jensen–Shannon divergence (JSD) between each intermediate-layer output and the final-layer output. The layer at which the divergence drops below a threshold $g$ is denoted as the “settling depth” for that token. If this stabilization occurs after a certain fraction $\rho$ of the total depth (e.g., $\rho = 0.85$ ), the token is flagged as a deep-thinking token.

Figure 1: Deep-thinking token identification: a token is classified as deep-thinking if its stabilization (JSD with the final-layer distribution below $g$ ) first occurs in the late regime of the network.

The deep-thinking ratio (DTR) for an output sequence is simply the fraction of its tokens classified as deep-thinking.

Figure 2: Heatmap of Jensen–Shannon divergence (JSD) across layers for an example sequence: functional tokens stabilize quickly, while answer-critical tokens undergo distributional shifts until late layers.

Empirical Correlation with Reasoning Accuracy

A major empirical finding is that output length correlates negatively with accuracy, reaffirming prior reports of “overthinking,” where models that generate overly long reasoning traces can actually degrade performance on complex tasks. In contrast, DTR exhibits a robust positive correlation with accuracy, stable across datasets, model architectures, and benchmarks.

Figure 3: Output token count is weakly (and often negatively) correlated with accuracy, while deep-thinking ratio (DTR) demonstrates strong, positive correlation with accuracy across datasets and model checkpoints.

DTR consistently outperforms not only token count but also log-probability, negative perplexity, negative entropy, and self-certainty metrics. Across 32 model–benchmark combinations, DTR is positive in nearly all (30/32), with average correlation $r = 0.683$ (compared to length’s $r = -0.594$ and confidence-based metrics in the $0.2$–$0.6$ range).

Sensitivity to Hyperparameters and Model Configuration

DTR’s correlation to accuracy is robust across reasonable hyperparameter choices for the settling threshold $g$ and depth fraction $\rho$ . Using stricter thresholds (higher $g$ , lower $\rho$ ) decreases the absolute DTR but maintains the trend. $g = 0.5$ and $\rho = 0.85$ provide the best trade-off between selectivity and robustness.

Further, DTR values encode information not just about individual responses but about system-level configuration: for instance, models prompted with “higher reasoning level” system instructions tend to produce lower DTRs but achieve higher accuracy, reflecting a redistribution of compute from per-token depth to sequence length.

Test-Time Compute Scaling with Think@ $n$

The DTR metric is used to implement Think@ $n$ , a test-time voting scheme where, instead of majority voting across all (or randomly selected) sampled responses, one selects those candidate generations with top DTR score as the voting pool, often using only short prefixes (e.g., first 50 tokens) for early rejection. This enables substantial inference compute savings by not fully decoding low-quality candidates.

Figure 4: Think@ $n$ achieves Pareto-optimal trade-off, matching or exceeding majority voting accuracy at approximately half the inference cost.

Across various datasets, Think@ $n$ matches or surpasses standard self-consistency with roughly 50% reduction in compute, outperforming both token-length and confidence-based selection methods. Cost reduction is achieved by early halting (based on DTR computed from a prefix), with no negative impact on accuracy. Ablation studies confirm this efficiency: a prefix of only 50 tokens suffices to achieve near-oracle performance.

Architectural Analysis and Interpretability

The layer-wise heatmap analysis (Figure 2) reveals that functional and template tokens converge quickly, whereas semantico-critical tokens (e.g., operation results, logic steps, answer statements) show late stabilization. Furthermore, this mechanism is agnostic of specific architectures and applies without modification to multiple families (GPT-OSS, DeepSeek-R1, Qwen3), and tasks (AIME, HMMT, GPQA).

Implications and Future Directions

The evidence provided here undermines the validity of surface-level proxies for computational effort in autoregressive LLMs and instead advances internal depth-wise evolution as a rigorous, mechanistic, and actionable signal. Practically, this discredits decode-time heuristics which equate output verbosity with depth of reasoning or infer that high-confidence output is always indicative of effective computation. Theoretically, DTR suggests directions for model development and evaluation that focus on maximizing efficient use of intermediate computation resources, and for weighting or pruning in ensemble/test-time selection.

The findings prompt several avenues for future work:

Cross-model and cross-task normalization: Since DTR may not be directly inter-comparable across fundamentally distinct models or system instructions, deeper analysis of how different prompts and architectures redistribute depth and token usage is warranted.
Fine-grained interpretability: Deeper analysis into specific categories of tokens, reasoning steps, or compositional subroutines that consistently show late stabilization could shed light on internal LLM decomposition of complex tasks.
Adaptive training signals: Future work may consider incorporating DTR directly into training or reinforcement learning objectives to maximize effective rather than superficial thinking effort.
Extension to multimodal or mixed-architecture models: The DTR framework is not inherently tied to text-only LLMs and may generalize to vision-language or code-generation models for assessing compositional effort.

Conclusion

"Think Deep, Not Just Long" provides a rigorous and empirically validated methodology for measuring and leveraging inference-time reasoning effort in LLMs, superseding traditional surface-level heuristics. By tracking the evolution of token prediction stabilization across model depth, deep-thinking ratio enables mechanism-driven test-time compute scaling, consistently yielding superior and more efficient performance aggregation in high-stakes reasoning benchmarks. The results motivate a reorientation of both evaluation and model optimization toward internal computation mechanisms rather than superficial generative measures.