Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inner Thinking Transformer (ITT)

Updated 23 February 2026
  • Inner Thinking Transformer (ITT) is an architectural enhancement that adaptively refines token representations through iterative internal thinking steps.
  • It employs Adaptive Token Routing, Residual Thinking Connections, and Thinking Step Encoding to selectively deepen computation for tokens requiring complex reasoning.
  • Empirical results show ITT achieves improved accuracy with reduced computational cost and training data, offering a practical trade-off between performance and efficiency.

The Inner Thinking Transformer (ITT) is an architectural enhancement to standard Transformers that introduces an adaptive, iterative “internal thinking” mechanism, enabling token-wise dynamic allocation of computational depth during both training and inference. ITT reinterprets Transformer layers as discrete “thinking steps,” equips them with selective token routing, and accumulates multi-phase latent representations, targeting efficiency and improved reasoning capacity without expanding parameter count (Chen et al., 19 Feb 2025).

1. Motivations and Theoretical Foundations

Standard Transformers with a fixed number of layers and no autoregression exhibit a provable expressivity ceiling: they are equivalent in computational power to the constant-depth circuit complexity class TC⁰. This restricts their ability to model compositional or hierarchical reasoning, as they cannot compute functions like parity or arbitrary-depth majority in a single forward pass. Architectural stress tests show that critical tokens demanding complex reasoning create abrupt layerwise gradient spikes, highlighting underfitting for parameter-constrained models (Mathur et al., 17 Jul 2025, Chen et al., 19 Feb 2025).

ITT addresses this limitation by:

  • Reconceptualizing each layer (or block) as an implicit “internal thinking” step.
  • Allowing adaptive, token-specific deepening of the computational graph by employing multiple latent refinement iterations, driven by input token “difficulty.”
  • Introducing explicit control over the number of effective reasoning steps, thus offering a trade-off between accuracy and computational cost (Chen et al., 19 Feb 2025).

2. Core Mechanisms: ATR, RTC, TSE

ITT’s improvements arise from three key components:

2.1 Adaptive Token Routing (ATR):

At each thinking step tt, a token-wise router network R(t)\mathcal{R}^{(t)} computes a real-valued importance score wi(t)w^{(t)}_i for all tokens ii. Only top-kk tokens—those exceeding a selected quantile threshold—undergo further refinement via the sub-layer f()f(\cdot). This enables selective intensification: Yi(t)={α(t)wi(t)f(Yi(t1))if wi(t)>Pρ(w(t)) Yi(t1)otherwiseY_i^{(t)\prime} = \begin{cases} \alpha^{(t)} w^{(t)}_i f\bigl( Y^{(t-1)}_i \bigr) & \text{if } w^{(t)}_i > P_\rho(w^{(t)}) \ Y^{(t-1)}_i & \text{otherwise} \end{cases} where PρP_\rho is the percentile threshold, and α(t)\alpha^{(t)} is a per-step scaling factor (Chen et al., 19 Feb 2025).

2.2 Residual Thinking Connections (RTC):

ITT accumulates all per-step outputs using learnable “thinking step encodings” ϕ(t)\phi^{(t)}, creating an explicit multi-phase residual path: x(t)=i=1tf(x(i1))ϕ(i)x^{(t)} = \sum_{i=1}^{t} f(x^{(i-1)}) \odot \phi^{(i)} This preserves step histories and allows deeper per-token iterative refinement while maintaining stable gradient flow, circumventing traditional vanishing/exploding issues.

2.3 Thinking Step Encoding (TSE):

Each iteration is associated with a vector ϕ(t)Rd\phi^{(t)} \in \mathbb{R}^{d} (for hidden size dd), which both indexes the phase and weights its contribution. These encodings are learned, not fixed, enabling phase-aware attention and output blending.

3. Dynamic Depth Scaling Algorithm

ITT implements a dynamic multi-step computation loop embedded within Transformer blocks. The core forward pass (akin to Algorithm 1 in (Chen et al., 19 Feb 2025)) involves:

  1. Initial sub-layer transformation to obtain Y(0)Y^{(0)}.
  2. For each thinking step t=1Tt = 1 \ldots T:
    • Compute routing scores w(t)w^{(t)} via a small MLP.
    • Select a subset (by percentile ρ\rho) of highest-scoring tokens for further processing.
    • Update those tokens using f()f(\cdot) and combine with their latent state via ϕ(t)\phi^{(t)} and α(t)\alpha^{(t)}.
    • Optionally perform early exit for tokens where loss falls below a threshold ε\varepsilon.
  3. Return the final refined representation, accumulated over all active “thinking” steps.

ITT layers may be interleaved with standard Transformer layers, and the depth/iteration parameters ρ\rho and TT can be tuned independently at inference to trade accuracy versus efficiency.

4. Empirical Performance and Data Efficiency

ITT demonstrates consistent improvements across a range of parameter scales (162M–466M) and evaluation tasks including SciQ, PIQA, WinoGrande, ARC-Easy, ARC-Challenge, HellaSwag, LogiQA, BoolQ, LAMBADA, and MMLU. Key results:

Model Params Avg Acc (↑) Relative FLOPs Data Used (%)
LLaMA2-162M 162M 40.4 1.88× 100
Loop×4-162M 162M 40.7 4.70×
ITT×4-162M 162M 42.1 3.29× 56.8
LLaMA2-230M 230M 41.8 2.87× 100
ITT×4-230M 230M 43.9 3.41×
LLaMA2-466M 466M 43.6 4.92× 100
ITT×4-466M 466M 45.3 5.84×

ITT×4-162M matches 96.5% of the accuracy of a 466M dense model at only 70% of its computational cost, and requires 43.2% less training data to reach comparable perplexity (Chen et al., 19 Feb 2025).

5. Design Benefits, Trade-offs, and Limitations

Elasticity: ITT allows test-time adaptation of computational effort by modifying ρ\rho (the token selection ratio) and TT (number of thinking steps), yielding a near-linear trade-off between compute and performance.

Parameter-Efficiency: By reusing weights for multiple latent refinement steps, ITT achieves scaling gains without parameter blowup.

Gradient Dynamics: RTC and TSE encourage efficient multi-step credit assignment, mitigating vanishing/exploding gradients and accelerating convergence, supported by theoretical analysis in the associated work.

Overhead and Computation: ATR incurs additional cost via routing and multiple sub-layer evaluations, but selective computation (with ρ\rho in [0.5,0.9][0.5, 0.9]) yields overall FLOP savings compared to simple looped or deep variants.

Implementation: ITT blocks are implemented by interleaving with standard layers. Routers are direct linear heads; step encodings and scaling factors are learned per block, and early-exit policies can reduce unnecessary computation.

The ITT paradigm generalizes and extends previous adaptive computation models, such as SELF-Transformers, which embed fixed-point iterations for attention refinement at the layer level without explicit token-wise routing (Mathur et al., 17 Jul 2025). While SELF-Transformers iteratively update the entire latent state to a fixed point for every input, ITT further introduces selective routing and phase-aware residuals, offering fine-grained elastic computation. The core iterative refinement in both models points toward a broader class of “inner thinking” Transformer designs. Other recent developments, such as AsyncReasoning (Yakushev et al., 11 Dec 2025), introduce real-time interleaving of private and public reasoning streams for LLMs but do not instantiate ITT’s explicit token-level depth-adaptivity.

7. Implications and Future Directions

ITT provides a practical and theoretically grounded mechanism for parameter-efficient adaptive reasoning in Transformers. By making internal “thinking” steps dynamic, selective, and accumulative, ITT opens new possibilities for balancing inference cost and task accuracy, model compression, efficient pretraining, and robust handling of outlier or “hard” tokens (Chen et al., 19 Feb 2025). A plausible implication is that further integration of learnable halting, cross-layer thought exchange, and introspective read-out—as suggested by developments in Self-Transformers—may drive future advances in introspective, chain-of-thought architectures and promote alignment with human-like inner reasoning processes (Mathur et al., 17 Jul 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inner Thinking Transformer (ITT).