Deep-Thinking Tokens in LLM Reasoning

Updated 2 March 2026

Deep-thinking tokens are model-generated or selected tokens essential for efficient and accurate LLM reasoning, defined through necessity, importance scores, and mutual information metrics.
They are identified using methods like conditional token selection, layerwise analysis, and perplexity deltas, which reveal their role in sustaining effective internal information flow.
Their extraction and compression through techniques such as CTS enhance inference speed and accuracy across various tasks, including multi-hop QA and mathematical problem solving.

A deep-thinking token is a model-generated or selected token within a chain of thought (CoT) whose presence is empirically or functionally necessary for the derivation of a correct answer or for the realization of nontrivial reasoning behavior. Deep-thinking tokens serve as the backbone of model reasoning: their identification and manipulation are central to efficient and accurate LLM inference and training. Across contemporary research, these tokens are characterized using metrics capturing importance for answer prediction, internal information flow, interpretability, and computational efficiency, and are increasingly leveraged for context compression, adaptive compute allocation, reasoning calibration, and evaluation of LLMs’ reasoning capabilities.

1. Formal Definitions and Theoretical Foundations

Deep-thinking tokens are definable in multiple, but often complementary, ways reflecting either their functional necessity, information-theoretic significance, or importance for model decision-making:

1.1. Conditional Necessity Definition

Given a CoT sequence $y = (y_1, \ldots, y_\ell) = x^{\mathrm{thk}} \oplus x^{\mathrm{ans}}$ , deep-thinking tokens are the subsequence $\tilde{y} \subseteq y$ such that

$p(A \mid \tilde{y}) \approx p(A \mid y)$

and any further removal degrades $p(A \mid \cdot)$ , where $p(A \mid \cdot)$ is the conditional answer distribution. This characterization separates minimal, answer-supporting tokens from “filler” or redundant members of the CoT (Yuan et al., 23 May 2025).

1.2. Importance-Score Definition

For token $y_i$ , its conditional importance score is

$r_i = \mathrm{PPL}(y_i \mid y_{<i}) - \mathrm{PPL}(y_i \mid x^{\mathrm{ans}}, y_{<i})$

where $\mathrm{PPL}$ is token-level perplexity as assessed by a reference model. High $r_i$ signifies that knowing the answer decisively reduces the uncertainty in predicting $y_i$ ; thus, $y_i$ is deep-thinking if $r_i$ exceeds a learned threshold (Yuan et al., 23 May 2025).

1.3. Layerwise Revision-Based Definition

A token is deep-thinking if the predicted next-token distribution $p_{t,l}$ (after layer $l$ ) does not converge to the final-layer distribution until a deep layer, i.e., it retains large Jensen-Shannon divergence $D_{t,l}$ vs. $p_{t,L}$ until $l \geq \lceil\rho L\rceil$ for a hyperparameter $\rho \in (0,1)$ . The deep-thinking ratio (DTR) is then the fraction of tokens per generation with such delayed convergence (Chen et al., 13 Feb 2026).

1.4. Information-Theoretic Definition

A token at step $t$ is identified as “deep-thinking” if its intermediate hidden state $z_t$ results in a mutual information (MI) peak $I(z_t; y)$ , as measured by the HSIC estimator. MI peaks mark points where the token’s internal representation provides a sudden increase in answer predictiveness (Qian et al., 3 Jun 2025).

2. Mechanisms for Scoring, Selection, and Compression

A central challenge is the efficient identification, extraction, or internalization of deep-thinking tokens, to optimize model cost and performance.

2.1. Conditional Token Selection (CTS)

CTS operates by:

Training a small reference model (RM) on concise expert reasoning steps;
For each token $y_i$ in a long CoT, calculating $r_i$ (see §1.2) within each segment;
Retaining only the top $\alpha$ -quantile of tokens per segment (configurable compression);
Fine-tuning the main model on compressed CoTs $\tilde{y}$ , typically with 3 epochs, batch size 16, and learning rate $1 \times 10^{-5}$ (Yuan et al., 23 May 2025).

CTS has outperformed naive compression baselines (TokenSkip, LLMLingua) across multiple benchmarks, yielding simultaneous improvements in reasoning accuracy and token efficiency (Table 1).

Table 1: CTS Compression Outcomes (Qwen2.5-14B-Instruct)

Method	Ratio (Actual)	MATH500	GPQA
Original	1.0	90.2% / 5012	51.5% / 12000
CTS α=0.9	0.9 (0.87)	91.6% / 4703	60.6% / 10413
CTS α=0.58	0.5 (0.58)	75.6% / 2036	46.5% / 2906

2.2. Reference Models, Perplexity Deltas, and Token Routing

Reference Model Training: Fine-tuned on minimal-step expert traces, emphasizing tokens that map effectively from question to answer.
Segment-Based Selection: Divides long CoTs to localize scoring, reducing conditional-independence errors.
Fine-Tuning: Trains models on compressed chains, maintaining accuracy as token count is reduced (Yuan et al., 23 May 2025).

3. Empirical Characterization and Model Performance

Empirical investigations demonstrate that most long-chain-of-thought samples overproduce tokens that can be excised with marginal impact or even performance benefits.

Compression-Accuracy Tradeoff: For GPQA, reducing tokens by 13.2% yields a +9.1% accuracy; a 75.8% reduction costs only 5pp in accuracy (Yuan et al., 23 May 2025).
Efficiency Gains: Substantial reductions in training token count correlate with improved or equal accuracy for math, multi-hop QA, and code tasks.
Generalization: Extraction and compression generalize across LLaMA, Qwen, and retrieval-augmented LLMs, improving inference latency and compute utilization.

The CTS tradeoff curve (accuracy vs. token retention ratio α) displays a “free lunch” region with concurrent gains in accuracy and efficiency before a graceful degradation (Yuan et al., 23 May 2025).

4. Extensions, Implications, and Open Questions

4.1. Task Generality and Model Adaptation

The deep-thinking token framework naturally adapts to diverse settings: commonsense reasoning, code synthesis, retrieval-augmented reasoning, and more (Yuan et al., 23 May 2025).
Budget-guided generation further enables real-time budget-aware control, trading off reasoning depth and speed with soft Gamma-distributed predictors (Li et al., 16 Jun 2025).

4.2. Model Interpretability and Limitations

Aggressive compression, while efficient, can erode CoT interpretability—a limitation for explainable AI paradigms (Yuan et al., 23 May 2025).
Success depends on the reference model’s access to high-quality, domain-specific distilled reasoning steps, which may not exist for all tasks.
Residual error from conditional independence assumptions, particularly when token interdependencies are strong (Yuan et al., 23 May 2025).
Perplexity-based importance may miss rare, high-leverage tokens; future methods may introduce gradient-based or game-theoretic attributions for token selection.

4.3. Information-Theoretic and Layerwise Perspectives

Deep-thinking tokens coincide with MI peaks, moments when the model’s internal representation becomes highly predictive of the answer; suppressing such tokens yields pronounced accuracy drops (up to –15pp) compared to random tokens (Qian et al., 3 Jun 2025).
Layerwise analysis reveals that deep-thinking tokens resist prediction “settling” until late transformer layers, a property highly correlated with correct model reasoning (Chen et al., 13 Feb 2026).

5. Comparative and Practical Contexts

5.1. Comparison to Naive or Unconditional Compression

CTS and related importance-based methods dominate length-based or unconditional token pruning in both maintained accuracy and token savings (Yuan et al., 23 May 2025).

5.2. Relation to Other Deep and Token-Efficient Mechanisms

Latent Codebooks: Instead of explicit CoT tokens, encode a mixture of prototypical high-level reasoning vectors. Fast inference is achieved by fetching a handful of “thinking token” vectors, bypassing the generation of thousands of tokens (Zheng et al., 28 Sep 2025).
Thinking States: Semi-latent reasoning pipelines generate explicit deep-thinking tokens at chunk boundaries, achieving comparable or superior accuracy with wall-clock speedups (Amos et al., 9 Feb 2026).
Budget Guidance: Predictive controllers softly constrain token count during generation, maintaining accuracy at a fraction of baseline token budgets (Li et al., 16 Jun 2025).
Token-Efficiency in RL: Dual-policy RL strategies (DuP-PO) penalize superfluous thinking tokens, reducing runaway “thinking traps” while increasing concise correct reasoning (Ding et al., 30 Jun 2025).

6. Broader Implications and Future Directions

The deep-thinking token paradigm marks a shift toward reasoning-efficient LLMs, where the focus is not solely on the number of reasoning steps but on the informational content and computational necessity of each token. Open problems include:

Scaling reference-model-based scoring to broad or low-data domains.
Integrating deep-thinking token identification into end-to-end model objectives or as an adaptive internal control mechanism.
Extending layerwise and information-theoretic analysis to unsupervised and multi-modal settings.
Systematically studying how deep-thinking tokens propagate through reasoning trajectories, influence calibration, and affect robustness to adversarial or spurious patterns.

Deep-thinking tokens foundationally enable model architectures and training schemes that preserve or enhance reasoning quality while drastically increasing efficiency, and their principled discovery and exploitation remain central to next-generation LLM research (Yuan et al., 23 May 2025, Chen et al., 13 Feb 2026, Zheng et al., 28 Sep 2025, Amos et al., 9 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (7)

Not All Tokens Are What You Need In Thinking (2025)

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens (2026)

Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning (2025)

Steering LLM Thinking with Budget Guidance (2025)

Fast Thinking for Large Language Models (2025)

Latent Reasoning with Supervised Thinking States (2026)

Do Thinking Tokens Help or Trap? Towards More Efficient Large Reasoning Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep-Thinking Tokens.

Deep-Thinking Tokens in LLM Reasoning

1. Formal Definitions and Theoretical Foundations

1.1. Conditional Necessity Definition

1.2. Importance-Score Definition

1.3. Layerwise Revision-Based Definition

1.4. Information-Theoretic Definition

2. Mechanisms for Scoring, Selection, and Compression

2.1. Conditional Token Selection (CTS)

2.2. Reference Models, Perplexity Deltas, and Token Routing

3. Empirical Characterization and Model Performance

4. Extensions, Implications, and Open Questions

4.1. Task Generality and Model Adaptation

4.2. Model Interpretability and Limitations

4.3. Information-Theoretic and Layerwise Perspectives

5. Comparative and Practical Contexts

5.1. Comparison to Naive or Unconditional Compression

5.2. Relation to Other Deep and Token-Efficient Mechanisms

6. Broader Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Deep-Thinking Tokens in LLM Reasoning

1. Formal Definitions and Theoretical Foundations

1.1. Conditional Necessity Definition

1.2. Importance-Score Definition

1.3. Layerwise Revision-Based Definition

1.4. Information-Theoretic Definition

2. Mechanisms for Scoring, Selection, and Compression

2.1. Conditional Token Selection (CTS)

2.2. Reference Models, Perplexity Deltas, and Token Routing

3. Empirical Characterization and Model Performance

4. Extensions, Implications, and Open Questions

4.1. Task Generality and Model Adaptation

4.2. Model Interpretability and Limitations

4.3. Information-Theoretic and Layerwise Perspectives

5. Comparative and Practical Contexts

5.1. Comparison to Naive or Unconditional Compression

5.2. Relation to Other Deep and Token-Efficient Mechanisms

6. Broader Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research