Think@n: Adaptive AI Reasoning
- Think@n is a suite of strategies for adaptive AI reasoning that leverages deep-thinking tokens and selection algorithms to optimize inference depth and performance.
- It employs mechanisms like Think-Anywhere and Think-or-Not to enable on-demand reasoning in code generation and vision-language tasks, significantly boosting accuracy while reducing inference costs.
- Experimental benchmarks confirm a strong correlation between deep-thinking metrics and answer accuracy, facilitating cost-efficient, selective reasoning across diverse applications.
Think@n encompasses a family of recent strategies and frameworks designed to probe, structure, and optimize reasoning in LLMs and vision-LLMs (VLMs), moving beyond the traditional focus on output length or test-time compute. These include methods for directly measuring inference-time depth, permitting on-demand or selective reasoning, and feedback-driven evaluation of higher-order cognition. Collectively, Think@n approaches prioritize adaptive, cost-efficient, and interpretable reasoning over brute-force token generation.
1. Deep-Thinking Tokens and the Think@n Selection Algorithm
The concept of deep-thinking tokens, introduced in "Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens" (Chen et al., 13 Feb 2026), provides a mechanistic, model-internal proxy for reasoning effort. Unlike naïve measures such as token count or per-token log-probability, deep-thinking tokens are identified layerwise: for an L-layer transformer, each generation step computes the Jensen–Shannon divergence (JSD) between intermediate layer distributions and the final layer output, identifying when a prediction “settles.” If the "settling depth" for token falls within a late-settling regime (the deepest layers, with ), the token is deemed a deep-thinking token.
The deep-thinking ratio (DTR) of a generated sequence of length is given by
representing the fraction of tokens for which the model requires full depth-wise computation to reach prediction convergence.
The Think@n selection algorithm leverages DTR at test time in a best-of-n decoding workflow. Given a prompt, samples are each decoded up to a fixed prefix length ; DTR is computed for these prefixes, the top samples by DTR (with 0 typically 1) are resumed to full decoding, and the final answer is chosen by majority vote over this truncated set. This protocol enables early rejection of unpromising traces, yielding substantial inference cost reduction with no loss—and often an improvement—in accuracy. Key defaults found robust across models are 2 (JSD threshold), 3 (depth-fraction for late settling), and 4.
2. Adaptive and Selective Reasoning: Think-Anywhere and Think-or-Not
Think@n also refers to a paradigm shift from monolithic, upfront reasoning (e.g., classic chain-of-thought) towards adaptive, context-driven invocation of reasoning steps.
- Think-Anywhere in Code Generation (Jiang et al., 31 Mar 2026) enables LLMs to interleave code emission with inline reasoning at any token position via special tokens such as
<thinkanywhere>...</thinkanywhere>. This is achieved through a two-stage training process: supervised “cold start” using data constructed to imitate on-demand thinking, followed by outcome-based reinforcement learning (RL) with Group Relative Policy Optimization (GRPO), using execution feedback. The result is a model that increases reasoning granularity at points of high entropy (uncertainty), naturally aligning reasoning effort with bottlenecks in logical complexity. In experiments across LeetCode, LiveCodeBench, HumanEval, and MBPP, Think@n yields an average pass@1 boost from 61.0% to 70.3%, outperforming prior code RL methods, and demonstrates transfer to mathematical reasoning tasks. - Think-or-Not (TON) for Vision-LLMs (Wang et al., 22 May 2025) introduces a two-stage approach to selective reasoning. In the supervised fine-tuning stage, a "thought dropout" operation randomly ablates reasoning traces, acculturating the model to accept the skip format. The subsequent GRPO reinforcement learning stage rewards both correct and format-conforming answers—including those where the model chooses to skip reasoning entirely. Empirical results show completion length reductions of up to 90% without loss—and sometimes with a gain—in accuracy. The skip-thought ratio systematically increases with reward during RL, indicating the model’s ability to internalize the “when to think” decision.
3. Evaluation and Practical Benchmarking
Think@n methods have been systematically benchmarked on a suite of mathematical, scientific, and code-generation tasks as well as multimodal datasets.
- Deep-Thinking Ratio (DTR) Correlation: Across 32 model-benchmark settings, DTR exhibits the strongest and most consistently positive correlation with answer accuracy (5), whereas raw token length is negatively correlated (6), outperforming log-probability, self-certainty, and mean entropy (Chen et al., 13 Feb 2026).
- Inference Cost Savings and Accuracy Tradeoff: On AIME 2025 and GPQA-Diamond, Think@n with 7 and 8 matches or exceeds standard self-consistency accuracy while halving the number of tokens required for inference. Length-based selection methods underperform and save little compute; confidence-based pruning reduces cost but leads to accuracy drops.
- Prefix-Ablation Stability: Accurate sample ranking is possible with short prefixes (9 tokens); extending to full answer lengths provides diminishing returns in accuracy but doubles inference cost.
- Selective Reasoning in VLMs: On CLEVR and GeoQA, TON (3B, 7B models) improves accuracy by up to 17 percentage points over vanilla GRPO, while reducing average reasoning trace length from 0939 tokens to as low as 112 (GeoQA, 7B).
4. Feedback-Driven and Multi-Agent Evaluation: THiNK Framework
The THiNK framework (Yu et al., 26 May 2025) advances model evaluation by moving from one-off accuracy to iterative, think-aloud protocols grounded in Bloom’s Taxonomy. LLMs are tasked with revising flawed mathematical word problems, critiqued by a bank of evaluation agents, each oriented to a distinct cognitive skill (remember, understand, apply, analyze, evaluate, create). A composite quality score aggregates pass rate, agent agreement, and confidence.
Closed- and open-source LLMs, after THiNK-guided revision, exhibit distinct cognitive profiles. Lower-order skills (remember, understand) reach high proficiency (88.48%, 76.02%), while apply, analyze, and create show moderate but consistent post-revision gains. Iterative, feedback-driven refinement increases alignment of model outputs with domain logic and problem structure, although transfer to mid-level application skills remains limited.
5. Algorithmic and Architectural Implementation
The core algorithms implementing Think@n variants include:
- DTR Extraction (Think@n selection): Utilizes transformer hidden states and unembedding layers to monitor convergence depth via JSD, requiring access only to intermediate activations.
- Code-Interleaved Reasoning (Think-Anywhere): Modifies the decoder to emit reasoning blocks at arbitrary points, with minimal architectural changes (semantic initialization of special-token embeddings).
- Selective Reasoning in VLMs (TON): Employs supervised SFT with thought dropout and RL with group-relative advantage, enforcing format compliance and outcome validity.
- Multi-Agent Rubric (THiNK): Embeds cognitive assessment in prompt engineering and output scoring, enabling traceable, fine-grained diagnosis of LLM cognition.
Table: Core Properties of Think@n Variants
| Approach | Reasoning Adaptivity | Training Regime |
|---|---|---|
| Deep-Thinking Selection | Depth-based, global | Zero-shot (test-time only) |
| Think-Anywhere (Code) | Fine-grained, on-demand | SFT + RL (teacher and GRPO) |
| Think-or-Not (VLM, TON) | Selective (skip/think) | SFT (dropout) + RL (GRPO) |
| THiNK (Evaluation) | Iterative, multilevel | Prompt/rubric-based |
6. Significance, Limitations, and Future Directions
Think@n strategies embody a move towards more instrumented, interpretable, and resource-efficient reasoning in LLMs and VLMs. By targeting inference-effort proxies that align with correctness (DTR), dynamically allocating reasoning steps (Think-Anywhere, TON), or scaffolded, rubric-driven assessment (THiNK), these approaches address the limitations of brute-force CoT and generate insight into both model limitations and cognitive signatures.
Notable practical findings include the stability of core hyperparameters, the generality of deep-thinking dynamics across model scales and tasks, and the transferability of on-demand reasoning beyond code to mathematics. Open directions remain in exploring alternative divergence and confidence metrics, distinguishing between depth-wise and temporal reasoning, and extending feedback-driven evaluation to richer task modalities.
This suggests that the evolving Think@n paradigm may constitute a foundation for scalable, adaptive, and cognitively-valid AI reasoning, with implications for model training, deployment, and educational interventions.