Think@$n$: Efficient Deep-Thinking in LLMs
- Think@$n$ is a test-time protocol that selects candidate generations by quantifying deep reasoning through the Deep-Thinking Ratio (DTR).
- The method computes DTR by analyzing token-level stabilization across transformer layers to identify deep-thinking tokens indicative of robust multi-layer reasoning.
- Empirical results show Think@$n$ reduces inference cost by around 50% while maintaining or slightly improving accuracy compared to standard self-consistency voting methods.
Think@ is a test-time scaling and selection protocol for LLMs that prioritizes candidate generations displaying a high proportion of deep-thinking tokens, as measured by the Deep-Thinking Ratio (DTR). This strategy enables effective early rejection of low-quality candidates, allowing LLMs to approximate or improve upon the accuracy of standard self-consistency voting methods while substantially reducing inference cost. Unlike approaches relying on generation length or confidence-based filtering, Think@ directly incorporates model-internal evidence of robust multi-layer reasoning effort, as made quantifiable by DTR.
1. Deep-Thinking Ratio: Definition and Computation
The Deep-Thinking Ratio (DTR) is defined as the proportion of tokens in a generated sequence whose predictive distributions only stabilize in the deepest fraction of the model's transformer layers. Formally, for an autoregressive LLM with layers and unembedding matrix , at each generation step , the hidden state after layer is . The output distribution at layer is given by .
The settling depth of token is the first layer where the running minimum Jensen–Shannon divergence between the final-layer distribution and all preceding layer distributions falls below a threshold :
A token is categorized as a deep-thinking token if its settling depth lies within the deepest fraction of layers, i.e., for some . The Deep-Thinking Ratio for a sequence is then:
This quantifies the share of tokens exhibiting late-settling, multi-layer reasoning effort (Chen et al., 13 Feb 2026).
2. Think@ Protocol: Algorithm and Implementation
Think@ operates as follows:
- For a prompt , sample candidate generations in parallel, each up to a fixed prefix length (e.g., 50 tokens).
- For each candidate , compute the prefix-level DTR:
- Rank all candidates by and select the top fraction (e.g., ).
- Fully decode only the selected candidates to completion.
- Output the answer by majority vote among the selected completions.
A high-level pseudo-code is:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Input: prompt x, n, prefix length ℓ_pref, keep fraction η
Output: final answer via voting
for i in 1..n do in parallel:
generate S_i up to ℓ_pref tokens
compute DTR_i ← DTR(S_i[:ℓ_pref])
rank candidates by DTR_i descending
select top k = ⌈η n⌉ indices
for each selected index i:
resume decoding S_i to [EOS]
collect full answers {A_i | i in selected}
return majority_vote({A_i}) |
This protocol retains only those continuations most likely to involve substantive, deep-layer model reasoning for further decoding, eliminating computational expense on superficial or overconfident candidates (Chen et al., 13 Feb 2026).
3. Empirical Evaluation and Performance Characteristics
Think@ has been benchmarked across diverse reasoning-intensive tasks (AIME 2024/2025, HMMT 2025, GPQA-diamond) and models (GPT-OSS, DeepSeek-R1, Qwen3). Its core findings include:
- Accuracy Correlation: DTR is a more robust and consistently positive proxy for accuracy than generation length, log-probability, or entropy. Across models and datasets, DTR correlates with bin-averaged accuracy (-values up to +0.926), outperforming length-based proxies (which may exhibit negative correlation) (Chen et al., 13 Feb 2026).
- Inference Cost Reduction: On representative large models (e.g., GPT-OSS-120B-medium, Qwen3-4B-Thinking), Think@ achieves 50% reduction in total token generation cost compared to standard self-consistency (Cons@) at , , and identical majority-vote aggregation.
- Accuracy Preservation and Improvement: Think@ matches or mildly exceeds Cons@ accuracy in all tested cases. On AIME 2025, accuracy improved from 92.7% (Cons@48) to 94.7% (Think@48), with tokens generated reducing from 307.6k to 155.4k.
The following table summarizes representative results for Think@48 versus Cons@48:
| Model/Benchmark | Cons@48 Accuracy | Think@48 Accuracy | Cost Reduction |
|---|---|---|---|
| GPT-OSS-120B AIME 2025 | 92.7% | 94.7% | –49% |
| Qwen3-4B-Thinking HMMT25 | 63.3% | 66.7% | –50% |
Think@ consistently outperforms length-based and self-certainty-based filtering methods under identical generation budgets (Chen et al., 13 Feb 2026).
4. Comparison with Alternative Deep-Thinking Metrics
Alternative operationalizations of deep thinking in LLMs have been proposed. For example, AdaptThink (Zhang et al., 19 May 2025) introduces a Deep-Thinking Ratio distinct from internal-layer settling:
- AdaptThink DTR: The DTR is the dataset-level proportion between “Thinking mode” (explicit chain-of-thought decoding) and “NoThinking mode” (immediate answer). It is controlled by a constrained reinforcement learning objective that adjusts the tradeoff via a Lagrange parameter , maximizing NoThinking responses subject to maintaining or improving accuracy.
- Instance Adaptivity: AdaptThink achieves superior accuracy–efficiency tradeoff by learning an instance-adaptive DTR. On easier problem sets, NoThinking is favored (87% on GSM8K), reducing cost. On difficult problems, Thinking mode dominates (40% NoThinking, 60% Thinking on hardest MATH500), preserving accuracy.
- Implementation Difference: AdaptThink’s DTR is not internal to token-level layerwise dynamics but is a high-level behavioral mode selector.
By contrast, Think@’s DTR is a mechanistic property computed from internal prediction convergence, not dialog or output modality (Chen et al., 13 Feb 2026, Zhang et al., 19 May 2025).
5. Interpretive Significance and Applications
The core insight underlying Think@ is that not all long generations or high-confidence outputs reflect genuine, substantive reasoning. The internal layerwise stabilization criterion for deep-thinking tokens offers a direct proxy for cognitive effort expended by the model. This enables:
- Efficient candidate selection at decoding time, concentrating computational resources on the most promising continuations.
- Consistent accuracy by reliably upweighting sequences with demonstrable multi-layer processing, independent of overall length.
- General applicability across diverse domains (math, science, open-ended reasoning), with stable correlation improvements over confidence- and length-based heuristics.
- Potential in ensemble and budgeted QA: By integrating Think@ into inference pipelines, downstream systems can achieve self-consistency-comparable performance with reduced latency and cost, offering practical advantages in large-scale deployment scenarios.
A plausible implication is that DTR-style metrics may be extendable to other architectures and control schemes wherein reasoning effort is not externally labelled but latent in model dynamics.
6. Relationship to Other Systems and Limitations
In dialog systems like DeepPsy-Agent (Chen et al., 20 Mar 2025), the term “deep thinking” refers to the system’s ability to conduct slow, multi-stage reasoning, evaluated via task-specific ablation metrics (e.g., root-cause identification, reduction in ineffective suggestions). No scalar DTR, token-level measurement, or formula is provided. Thus, Think@’s DTR and operational definition are fundamentally distinct: they offer a model-structural, not behavior-labelled or stage-annotated, quantification of reasoning.
A key limitation of Think@ is reliance on access to model internals (layerwise hidden states and projections). Its efficacy presumes that inference time computational depth is meaningfully aligned with solution quality—an assumption validated on current mathematical and scientific benchmarks (Chen et al., 13 Feb 2026), but requiring further study for open-domain tasks.
7. Conclusion
Think@ exemplifies a principled advancement in reasoning-focused decoding for LLMs by leveraging internal convergence dynamics via the Deep-Thinking Ratio. Its use of early-stage evidence to guide candidate selection demonstrates that inference efficiency and accuracy need not trade off; rather, conditioning on deep-layer token stabilization enables substantial reduction in computational cost while matching or exceeding conventional ensemble voting accuracy (Chen et al., 13 Feb 2026). This methodology distinguishes itself from length- or output-mode-based heuristics by employing mechanistic, task-agnostic signals directly rooted in model computation, marking a significant step in the science of efficient evaluative reasoning in LLMs.