Learned Continue-Thinking Tokens

Updated 18 April 2026

Learned continue-thinking tokens are specialized symbols integrated into LLM sequences to trigger, extend, or compress internal reasoning with both discrete and continuous representations.
They encompass various forms—such as discrete markers, information-theoretic peaks, latent planning tokens, and soft embeddings—to optimize compute allocation and gradient flow.
Empirical evidence shows that these tokens improve reasoning accuracy and efficiency, though excessive insertion may lead to diminished performance.

Learned continue-thinking tokens are specialized symbols—discrete or continuous, explicit or latent—incorporated into LLM sequences to trigger, extend, or compress the model’s internal reasoning between observed or generated content. These mechanisms are designed to allocate extra compute, enhance gradient flow, stabilize information dynamics, or adaptively focus on the most salient steps for complex reasoning tasks. Unlike naive repetition or static tokens, the learned variants are algorithmically or empirically associated with information-theoretic markers (such as mutual information peaks), optimized embedding representations, or explicit training signals, yielding direct improvements in accuracy, compute efficiency, or reasoning faithfulness compared to baseline approaches.

1. Formal Definitions and Construction of Continue-Thinking Tokens

Continue-thinking tokens manifest in multiple concrete forms:

Discrete thinking tokens: In Herel & Mikolov's “Thinking Tokens” framework, a special symbol (notation: ‹T›) is inserted after each word $w_i$ in the input sequence, producing a new sequence $x' = [w_1,\langle T\rangle, w_2, \langle T\rangle, ..., w_L, \langle T\rangle]$ , where $N$ copies of ‹T› can be interleaved per word (Herel et al., 2024). These tokens serve purely as “think time”—no loss is backpropagated through them.
Information-theoretic peaks: A distinct family emerges from tracking mutual information between the model's intermediate states $h_t$ and the gold answer's representation $h_y$ , flagging “MI peaks” as the locus of true reflection; the associated natural-language tokens (e.g., "Hmm", "Therefore") are then selected as "thinking tokens" (Qian et al., 3 Jun 2025).
Planning or latent reasoning tokens: In supervised or RL frameworks (e.g., SFT, DPO, GRPO), explicit markers such as > ... designate reasoning regions of the sequence. The policy may explicitly allocate probability mass for “thinking” versus “answer” modes (Singla et al., 18 Oct 2025).
Continuous—multiplex or soft—tokens: Embeddings that represent mixtures of multiple possible next steps (e.g., via the multiplex merge of several candidate token embeddings at each step (Tang et al., 13 Jan 2026)), or softmax-weighted linear combinations with noise (the “soft tokens” of (Butt et al., 23 Sep 2025)), introduce more expressivity and stochasticity.
Learned embeddings for specialized use: Single tokens (e.g., <|continue-thinking|>) added to the vocabulary and optimized for the forward pass while the backbone remains frozen (Ringel et al., 12 Jun 2025).

Mechanistically, all continue-thinking tokens are unified by their role as either explicit or implicit triggers for extra computation, latent information processing, or controlled extension of the CoT (chain-of-thought) trajectory.

2. Training Protocols and Optimization Objectives

The methodology for learning or leveraging continue-thinking tokens varies significantly by framework:

Supervised augmentation: Direct masking and loss omission for thinking tokens (delta mask $\delta_t$ in the objective $L(\theta) = -\sum_{t=1}^{T'} \delta_t \log P_\theta(y'_t | y'_{<t})$ ) (Herel et al., 2024).
Information-theoretic selection: Thinking tokens are mined post hoc from generative runs of a base LLM by labeling tokens at MI peaks via Hilbert–Schmidt Independence Criterion (HSIC) applied to $(h_t, h_y)$ pairs (Qian et al., 3 Jun 2025).
Conditional token selection (compression): Each reasoning token $y_i$ is assigned an importance score $r_i = \mathrm{PPL}_{\rm unc}(y_i) - \mathrm{PPL}_{\rm cond}(y_i)$ (measuring its drop in perplexity once the final answer is known), and only tokens above a quantile threshold are preserved for training, enforcing an adaptive compression ratio (Yuan et al., 23 May 2025).
Reinforcement learning of embeddings: In the RL-based approach, only the embedding $x' = [w_1,\langle T\rangle, w_2, \langle T\rangle, ..., w_L, \langle T\rangle]$ 0 for <|continue-thinking|> is updated through policy gradients; the reward is a function of answer correctness and required format, and generation is forced to substitute <|continue-thinking|> for the end-of-think token up to a preset budget (Ringel et al., 12 Jun 2025).
Latent continuous token RL: In “soft token” methods, the entire CoT phase is trained by policy gradient with softmax-weighted or multiplexed token embedding mixtures at each step; exploration is implemented via Gaussian noise injection into embedding space, and correctness signals guide reward (Butt et al., 23 Sep 2025, Tang et al., 13 Jan 2026).

The joint effect is optimization for improved chain-of-thought expressivity, test-time scaling, exploration-exploitation balance, or data efficiency.

3. Architectural Integration and Inference Workflows

No fundamental architectural change is required for incorporating continue-thinking tokens:

Vanilla RNN/Transformer: For explicit tokens (e.g., ‹T›), integration into RNN or Transformer models only requires insertion into the vocabulary and context window, and loss-masking during training (Herel et al., 2024).
Modular planning slots: Chain-of-thought markers or reasoning regions (e.g., <think>...<\think>) are implemented as special tokens delimiting policy subspaces in SFT/DPO/GRPO (Singla et al., 18 Oct 2025).
Latent reasoning blocks: The “Thinking States” method augments a backbone model with lightweight, unidirectional blocks for reasoning state generation, compression, and recurrent embedding update, supporting full parallelization via teacher-forcing (Amos et al., 9 Feb 2026).
Multiplex continuous tokens: Token-wise branch-merge mechanisms via sampling $x' = [w_1,\langle T\rangle, w_2, \langle T\rangle, ..., w_L, \langle T\rangle]$ 1 candidates and merging their embeddings function as a drop-in to standard Transformer embedding layers, requiring only sequence-length increases to accommodate multiplex tokens (Tang et al., 13 Jan 2026).
Conditional token pruning/compression: At each inference step, conditional importance thresholds determine which tokens are retained or suppressed on the fly, reducing unnecessary computation (Yuan et al., 23 May 2025).
Budget-forcing at test time: Fixed or learned continue-thinking tokens are injected explicitly during inference to extend CoT length and elicit further reasoning (Ringel et al., 12 Jun 2025, Qian et al., 3 Jun 2025).

This flexibility enables application across standard, distilled, and RL-optimized LLMs, with adjustments focused mainly on data augmentation, pre-processing, or embedding-table expansion.

4. Empirical Outcomes, Trade-Offs, and Theoretical Insights

Continue-thinking tokens yield diverse empirical gains:

Approach	Computation Cost	Accuracy/Metric Gains	Additional Findings
Thinking Tokens (‹T›, N=1) (Herel et al., 2024)	Mild increase (seq. length)	- Maths queries: PPL reduced 16.8→13.1, 24.3→19.8	More than N=1 degrades perplexity; no core arch. change needed
Info-theoretic peaks (Qian et al., 3 Jun 2025)	No runtime overhead	Suppressing thinking tokens: accuracy drops 50%→15%	Gains: +1–3% ABS on GSM8K/MATH500, +10–15% on AIME24; extendable to new domains
Conditional Token Selection (Yuan et al., 23 May 2025)	Substantial token savings	GPQA: +9.1% accuracy at –13% tokens; –5% with –75.8% tokens	Performance robust to strong compression; proven redundancy in CoT
Learned embedding (Ringel et al., 12 Jun 2025)	No LLM retraining	GSM8K: +4.2% ABS (vs. +1.3% for fixed “Wait”)	Improved chain-of-thought quality, generalization to more continuation steps
Multiplex/Soft tokens (Butt et al., 23 Sep 2025, Tang et al., 13 Jan 2026)	Trivial overhead	Pass@32: +2–5% over hard CoT; higher OOD preservation	Training with soft tokens followed by hard inference optimal; higher diversity and stability

Theoretical analysis confirms that mutual information concentration at select steps (MI peaks) matches tokens empirically important for correct reasoning, and that the overall error rate is bounded above and below by cumulative MI contributions at these steps (Qian et al., 3 Jun 2025). Multiplex and continuous tokens realize information superposition, enabling more efficient encoding and exploration of reasoning paths (Butt et al., 23 Sep 2025, Tang et al., 13 Jan 2026).

A notable limitation is diminishing returns or even performance drop with excessive token insertion (overthinking), as the model can lose track of context or introduce noise (Herel et al., 2024). Likewise, MI estimation is performed via proxies (HSIC), and direct causal attribution of reasoning improvement to token mechanics is an open question (Qian et al., 3 Jun 2025).

5. Extensions: Modality, Supervision, and Domain Adaptation

Continue-thinking mechanisms have been deployed or critiqued across various domains and supervision regimes:

Latent reasoning with parallelization: The “Thinking States” approach enables full teacher-forcing and parallel computation with natural-language “thought tokens” generated and re-compressed at fixed points in the input stream (Amos et al., 9 Feb 2026). This yields state-tracking accuracy 100% in OOD (vs. 64.4% for CoT) and a 2× speedup.
Impact on machine translation: Direct transfer of thinking tokens or synthetic CoT explanations to MT did not improve translation performance; only modular traces embedding actual translation drafts yielded notable gains, suggesting the utility of continue-thinking tokens is highly domain-dependent (Zebaze et al., 13 Oct 2025).
Planning and self-awareness: Reasoning-trace tokens arising from planning policies (“think” slots) in SFT, DPO, or GRPO are subject to evaluation for faithfulness, OOD generalization, and self-introspection; results show RL-based policies yield stronger awareness and generalization but can decouple internal traces from final answers (Singla et al., 18 Oct 2025).
Adaptive and compressive agents: The conditional token selection framework offers a pathway to fully differentiable gating of reasoning steps, extensible to multimodal or code-based tasks (Yuan et al., 23 May 2025).

Overall, there is strong evidence that continue-thinking tokens’ impact varies with the reasoning domain, the quality of annotation, and the underlying decoding policy.

6. Practical Considerations, Limitations, and Perspectives

Learned continue-thinking tokens are a lightweight, modular tool for improving LLM reasoning performance, especially on complex mathematical or multi-step tasks. They can be introduced without core architecture changes, learned via RL or supervised masking, and interpreted via information-theoretic or empirical analyses. However, they require careful tuning: too many tokens can impair efficiency, and their utility is governed by task structure—mere natural-language “thoughts” are insufficient for tasks such as machine translation unless they encode explicit, intermediate solutions (Zebaze et al., 13 Oct 2025).

Open directions include dynamic token-insertion policies (learned, not static), differentiated multi-token schemes for hierarchical reasoning, and principled exploration strategies that go beyond random perturbation. Information-theoretic measurements of reasoning are still indirect, and extending these methods to non-math, non-English, or genuinely open-ended domains remains ongoing work.

The development of learned continue-thinking tokens demonstrates a convergence of data-centric, optimization-based, and information-theoretic advances in large-scale LLM reasoning frameworks.