Papers
Topics
Authors
Recent
Search
2000 character limit reached

Think@$n$: Efficient Deep-Thinking in LLMs

Updated 2 March 2026
  • Think@$n$ is a test-time protocol that selects candidate generations by quantifying deep reasoning through the Deep-Thinking Ratio (DTR).
  • The method computes DTR by analyzing token-level stabilization across transformer layers to identify deep-thinking tokens indicative of robust multi-layer reasoning.
  • Empirical results show Think@$n$ reduces inference cost by around 50% while maintaining or slightly improving accuracy compared to standard self-consistency voting methods.

Think@nn is a test-time scaling and selection protocol for LLMs that prioritizes candidate generations displaying a high proportion of deep-thinking tokens, as measured by the Deep-Thinking Ratio (DTR). This strategy enables effective early rejection of low-quality candidates, allowing LLMs to approximate or improve upon the accuracy of standard self-consistency voting methods while substantially reducing inference cost. Unlike approaches relying on generation length or confidence-based filtering, Think@nn directly incorporates model-internal evidence of robust multi-layer reasoning effort, as made quantifiable by DTR.

1. Deep-Thinking Ratio: Definition and Computation

The Deep-Thinking Ratio (DTR) is defined as the proportion of tokens in a generated sequence whose predictive distributions only stabilize in the deepest fraction of the model's transformer layers. Formally, for an autoregressive LLM fθf_\theta with LL layers and unembedding matrix WURV×dW_U \in \mathbb{R}^{|V| \times d}, at each generation step tt, the hidden state after layer ll is ht,lRdh_{t,l} \in \mathbb{R}^d. The output distribution at layer ll is given by pt,l=softmax(WUht,l)p_{t,l} = \mathrm{softmax}(W_U h_{t,l}).

The settling depth ctc_t of token yty_t is the first layer where the running minimum Jensen–Shannon divergence between the final-layer distribution and all preceding layer distributions falls below a threshold gg:

ct=min{l:Dˉt,lg},Dˉt,l=min1jlJSD(pt,Lpt,j)c_t = \min \{ l : \bar D_{t,l} \le g \}, \qquad \bar D_{t,l} = \min_{1 \le j \le l} \mathrm{JSD}(p_{t,L} \| p_{t,j})

A token is categorized as a deep-thinking token if its settling depth ctc_t lies within the deepest ρ\rho fraction of layers, i.e., ctρLc_t \ge \lceil \rho L \rceil for some ρ(0,1)\rho \in (0, 1). The Deep-Thinking Ratio for a sequence S=(y1,...,yT)S = (y_1, ..., y_T) is then:

DTR(S)=1Tt=1T1[ctρL]\mathrm{DTR}(S) = \frac{1}{T}\sum_{t=1}^T \mathbb{1}[c_t \ge \lceil\rho L\rceil]

This quantifies the share of tokens exhibiting late-settling, multi-layer reasoning effort (Chen et al., 13 Feb 2026).

2. Think@nn Protocol: Algorithm and Implementation

Think@nn operates as follows:

  1. For a prompt xx, sample nn candidate generations in parallel, each up to a fixed prefix length prefix\ell_{\mathrm{prefix}} (e.g., 50 tokens).
  2. For each candidate ii, compute the prefix-level DTR:

DTR^i=DTR(Si[:prefix])\widehat{\mathrm{DTR}}_i = \mathrm{DTR}(S_i[:\ell_{\mathrm{prefix}}])

  1. Rank all candidates by DTR^i\widehat{\mathrm{DTR}}_i and select the top η\eta fraction (e.g., η=50%\eta = 50\%).
  2. Fully decode only the selected candidates to completion.
  3. Output the answer by majority vote among the selected completions.

A high-level pseudo-code is:

1
2
3
4
5
6
7
8
9
10
11
12
13
Input:   prompt x, n, prefix length ℓ_pref, keep fraction η
Output:  final answer via voting

for i in 1..n do in parallel:
    generate S_i up to ℓ_pref tokens
    compute DTR_i ← DTR(S_i[:ℓ_pref])
rank candidates by DTR_i descending
select top k = ⌈η n⌉ indices

for each selected index i:
    resume decoding S_i to [EOS]
collect full answers {A_i | i in selected}
return majority_vote({A_i})

This protocol retains only those continuations most likely to involve substantive, deep-layer model reasoning for further decoding, eliminating computational expense on superficial or overconfident candidates (Chen et al., 13 Feb 2026).

3. Empirical Evaluation and Performance Characteristics

Think@nn has been benchmarked across diverse reasoning-intensive tasks (AIME 2024/2025, HMMT 2025, GPQA-diamond) and models (GPT-OSS, DeepSeek-R1, Qwen3). Its core findings include:

  • Accuracy Correlation: DTR is a more robust and consistently positive proxy for accuracy than generation length, log-probability, or entropy. Across models and datasets, DTR correlates with bin-averaged accuracy (rr-values up to +0.926), outperforming length-based proxies (which may exhibit negative correlation) (Chen et al., 13 Feb 2026).
  • Inference Cost Reduction: On representative large models (e.g., GPT-OSS-120B-medium, Qwen3-4B-Thinking), Think@nn achieves \approx50% reduction in total token generation cost compared to standard self-consistency (Cons@nn) at n=48n=48, η=50%\eta=50\%, and identical majority-vote aggregation.
  • Accuracy Preservation and Improvement: Think@nn matches or mildly exceeds Cons@nn accuracy in all tested cases. On AIME 2025, accuracy improved from 92.7% (Cons@48) to 94.7% (Think@48), with tokens generated reducing from 307.6k to 155.4k.

The following table summarizes representative results for Think@48 versus Cons@48:

Model/Benchmark Cons@48 Accuracy Think@48 Accuracy Cost Reduction
GPT-OSS-120B AIME 2025 92.7% 94.7% –49%
Qwen3-4B-Thinking HMMT25 63.3% 66.7% –50%

Think@nn consistently outperforms length-based and self-certainty-based filtering methods under identical generation budgets (Chen et al., 13 Feb 2026).

4. Comparison with Alternative Deep-Thinking Metrics

Alternative operationalizations of deep thinking in LLMs have been proposed. For example, AdaptThink (Zhang et al., 19 May 2025) introduces a Deep-Thinking Ratio distinct from internal-layer settling:

  • AdaptThink DTR: The DTR is the dataset-level proportion between “Thinking mode” (explicit chain-of-thought decoding) and “NoThinking mode” (immediate answer). It is controlled by a constrained reinforcement learning objective that adjusts the tradeoff via a Lagrange parameter δ\delta, maximizing NoThinking responses subject to maintaining or improving accuracy.
  • Instance Adaptivity: AdaptThink achieves superior accuracy–efficiency tradeoff by learning an instance-adaptive DTR. On easier problem sets, NoThinking is favored (\approx87% on GSM8K), reducing cost. On difficult problems, Thinking mode dominates (\approx40% NoThinking, \approx60% Thinking on hardest MATH500), preserving accuracy.
  • Implementation Difference: AdaptThink’s DTR is not internal to token-level layerwise dynamics but is a high-level behavioral mode selector.

By contrast, Think@nn’s DTR is a mechanistic property computed from internal prediction convergence, not dialog or output modality (Chen et al., 13 Feb 2026, Zhang et al., 19 May 2025).

5. Interpretive Significance and Applications

The core insight underlying Think@nn is that not all long generations or high-confidence outputs reflect genuine, substantive reasoning. The internal layerwise stabilization criterion for deep-thinking tokens offers a direct proxy for cognitive effort expended by the model. This enables:

  • Efficient candidate selection at decoding time, concentrating computational resources on the most promising continuations.
  • Consistent accuracy by reliably upweighting sequences with demonstrable multi-layer processing, independent of overall length.
  • General applicability across diverse domains (math, science, open-ended reasoning), with stable correlation improvements over confidence- and length-based heuristics.
  • Potential in ensemble and budgeted QA: By integrating Think@nn into inference pipelines, downstream systems can achieve self-consistency-comparable performance with reduced latency and cost, offering practical advantages in large-scale deployment scenarios.

A plausible implication is that DTR-style metrics may be extendable to other architectures and control schemes wherein reasoning effort is not externally labelled but latent in model dynamics.

6. Relationship to Other Systems and Limitations

In dialog systems like DeepPsy-Agent (Chen et al., 20 Mar 2025), the term “deep thinking” refers to the system’s ability to conduct slow, multi-stage reasoning, evaluated via task-specific ablation metrics (e.g., root-cause identification, reduction in ineffective suggestions). No scalar DTR, token-level measurement, or formula is provided. Thus, Think@nn’s DTR and operational definition are fundamentally distinct: they offer a model-structural, not behavior-labelled or stage-annotated, quantification of reasoning.

A key limitation of Think@nn is reliance on access to model internals (layerwise hidden states and projections). Its efficacy presumes that inference time computational depth is meaningfully aligned with solution quality—an assumption validated on current mathematical and scientific benchmarks (Chen et al., 13 Feb 2026), but requiring further study for open-domain tasks.

7. Conclusion

Think@nn exemplifies a principled advancement in reasoning-focused decoding for LLMs by leveraging internal convergence dynamics via the Deep-Thinking Ratio. Its use of early-stage evidence to guide candidate selection demonstrates that inference efficiency and accuracy need not trade off; rather, conditioning on deep-layer token stabilization enables substantial reduction in computational cost while matching or exceeding conventional ensemble voting accuracy (Chen et al., 13 Feb 2026). This methodology distinguishes itself from length- or output-mode-based heuristics by employing mechanistic, task-agnostic signals directly rooted in model computation, marking a significant step in the science of efficient evaluative reasoning in LLMs.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Think@$n$.