Inner Thinking Transformer (ITT)

Updated 23 February 2026

Inner Thinking Transformer (ITT) is an architectural enhancement that adaptively refines token representations through iterative internal thinking steps.
It employs Adaptive Token Routing, Residual Thinking Connections, and Thinking Step Encoding to selectively deepen computation for tokens requiring complex reasoning.
Empirical results show ITT achieves improved accuracy with reduced computational cost and training data, offering a practical trade-off between performance and efficiency.

The Inner Thinking Transformer (ITT) is an architectural enhancement to standard Transformers that introduces an adaptive, iterative “internal thinking” mechanism, enabling token-wise dynamic allocation of computational depth during both training and inference. ITT reinterprets Transformer layers as discrete “thinking steps,” equips them with selective token routing, and accumulates multi-phase latent representations, targeting efficiency and improved reasoning capacity without expanding parameter count (Chen et al., 19 Feb 2025).

1. Motivations and Theoretical Foundations

Standard Transformers with a fixed number of layers and no autoregression exhibit a provable expressivity ceiling: they are equivalent in computational power to the constant-depth circuit complexity class TC⁰. This restricts their ability to model compositional or hierarchical reasoning, as they cannot compute functions like parity or arbitrary-depth majority in a single forward pass. Architectural stress tests show that critical tokens demanding complex reasoning create abrupt layerwise gradient spikes, highlighting underfitting for parameter-constrained models (Mathur et al., 17 Jul 2025, Chen et al., 19 Feb 2025).

ITT addresses this limitation by:

Reconceptualizing each layer (or block) as an implicit “internal thinking” step.
Allowing adaptive, token-specific deepening of the computational graph by employing multiple latent refinement iterations, driven by input token “difficulty.”
Introducing explicit control over the number of effective reasoning steps, thus offering a trade-off between accuracy and computational cost (Chen et al., 19 Feb 2025).

2. Core Mechanisms: ATR, RTC, TSE

ITT’s improvements arise from three key components:

2.1 Adaptive Token Routing (ATR):

At each thinking step $t$ , a token-wise router network $\mathcal{R}^{(t)}$ computes a real-valued importance score $w^{(t)}_i$ for all tokens $i$ . Only top- $k$ tokens—those exceeding a selected quantile threshold—undergo further refinement via the sub-layer $f(\cdot)$ . This enables selective intensification: $Y_i^{(t)\prime} = \begin{cases} \alpha^{(t)} w^{(t)}_i f\bigl( Y^{(t-1)}_i \bigr) & \text{if } w^{(t)}_i > P_\rho(w^{(t)}) \ Y^{(t-1)}_i & \text{otherwise} \end{cases}$ where $P_\rho$ is the percentile threshold, and $\alpha^{(t)}$ is a per-step scaling factor (Chen et al., 19 Feb 2025).

2.2 Residual Thinking Connections (RTC):

ITT accumulates all per-step outputs using learnable “thinking step encodings” $\phi^{(t)}$ , creating an explicit multi-phase residual path: $\mathcal{R}^{(t)}$ 0 This preserves step histories and allows deeper per-token iterative refinement while maintaining stable gradient flow, circumventing traditional vanishing/exploding issues.

2.3 Thinking Step Encoding (TSE):

Each iteration is associated with a vector $\mathcal{R}^{(t)}$ 1 (for hidden size $\mathcal{R}^{(t)}$ 2), which both indexes the phase and weights its contribution. These encodings are learned, not fixed, enabling phase-aware attention and output blending.

3. Dynamic Depth Scaling Algorithm

ITT implements a dynamic multi-step computation loop embedded within Transformer blocks. The core forward pass (akin to Algorithm 1 in (Chen et al., 19 Feb 2025)) involves:

Initial sub-layer transformation to obtain $\mathcal{R}^{(t)}$ 3.
For each thinking step $\mathcal{R}^{(t)}$ $R^{(t)}$ 4:
- Compute routing scores $\mathcal{R}^{(t)}$ 5 via a small MLP.
- Select a subset (by percentile $\mathcal{R}^{(t)}$ 6) of highest-scoring tokens for further processing.
- Update those tokens using $\mathcal{R}^{(t)}$ 7 and combine with their latent state via $\mathcal{R}^{(t)}$ 8 and $\mathcal{R}^{(t)}$ 9.
- Optionally perform early exit for tokens where loss falls below a threshold $w^{(t)}_i$ 0.
Return the final refined representation, accumulated over all active “thinking” steps.

ITT layers may be interleaved with standard Transformer layers, and the depth/iteration parameters $w^{(t)}_i$ 1 and $w^{(t)}_i$ 2 can be tuned independently at inference to trade accuracy versus efficiency.

4. Empirical Performance and Data Efficiency

ITT demonstrates consistent improvements across a range of parameter scales (162M–466M) and evaluation tasks including SciQ, PIQA, WinoGrande, ARC-Easy, ARC-Challenge, HellaSwag, LogiQA, BoolQ, LAMBADA, and MMLU. Key results:

Model	Params	Avg Acc (↑)	Relative FLOPs	Data Used (%)
LLaMA2-162M	162M	40.4	1.88×	100
Loop×4-162M	162M	40.7	4.70×	—
ITT×4-162M	162M	42.1	3.29×	56.8
LLaMA2-230M	230M	41.8	2.87×	100
ITT×4-230M	230M	43.9	3.41×	—
LLaMA2-466M	466M	43.6	4.92×	100
ITT×4-466M	466M	45.3	5.84×	—

ITT×4-162M matches 96.5% of the accuracy of a 466M dense model at only 70% of its computational cost, and requires 43.2% less training data to reach comparable perplexity (Chen et al., 19 Feb 2025).

5. Design Benefits, Trade-offs, and Limitations

Elasticity: ITT allows test-time adaptation of computational effort by modifying $w^{(t)}_i$ 3 (the token selection ratio) and $w^{(t)}_i$ 4 (number of thinking steps), yielding a near-linear trade-off between compute and performance.

Parameter-Efficiency: By reusing weights for multiple latent refinement steps, ITT achieves scaling gains without parameter blowup.

Gradient Dynamics: RTC and TSE encourage efficient multi-step credit assignment, mitigating vanishing/exploding gradients and accelerating convergence, supported by theoretical analysis in the associated work.

Overhead and Computation: ATR incurs additional cost via routing and multiple sub-layer evaluations, but selective computation (with $w^{(t)}_i$ 5 in $w^{(t)}_i$ 6) yields overall FLOP savings compared to simple looped or deep variants.

Implementation: ITT blocks are implemented by interleaving with standard layers. Routers are direct linear heads; step encodings and scaling factors are learned per block, and early-exit policies can reduce unnecessary computation.

The ITT paradigm generalizes and extends previous adaptive computation models, such as SELF-Transformers, which embed fixed-point iterations for attention refinement at the layer level without explicit token-wise routing (Mathur et al., 17 Jul 2025). While SELF-Transformers iteratively update the entire latent state to a fixed point for every input, ITT further introduces selective routing and phase-aware residuals, offering fine-grained elastic computation. The core iterative refinement in both models points toward a broader class of “inner thinking” Transformer designs. Other recent developments, such as AsyncReasoning (Yakushev et al., 11 Dec 2025), introduce real-time interleaving of private and public reasoning streams for LLMs but do not instantiate ITT’s explicit token-level depth-adaptivity.

7. Implications and Future Directions

ITT provides a practical and theoretically grounded mechanism for parameter-efficient adaptive reasoning in Transformers. By making internal “thinking” steps dynamic, selective, and accumulative, ITT opens new possibilities for balancing inference cost and task accuracy, model compression, efficient pretraining, and robust handling of outlier or “hard” tokens (Chen et al., 19 Feb 2025). A plausible implication is that further integration of learnable halting, cross-layer thought exchange, and introspective read-out—as suggested by developments in Self-Transformers—may drive future advances in introspective, chain-of-thought architectures and promote alignment with human-like inner reasoning processes (Mathur et al., 17 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking (2025)

Change of Thought: Adaptive Test-Time Computation (2025)

Asynchronous Reasoning: Training-Free Interactive Thinking LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inner Thinking Transformer (ITT).

Inner Thinking Transformer (ITT)

1. Motivations and Theoretical Foundations

2. Core Mechanisms: ATR, RTC, TSE

3. Dynamic Depth Scaling Algorithm

4. Empirical Performance and Data Efficiency

5. Design Benefits, Trade-offs, and Limitations

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Inner Thinking Transformer (ITT)

1. Motivations and Theoretical Foundations

2. Core Mechanisms: ATR, RTC, TSE

3. Dynamic Depth Scaling Algorithm

4. Empirical Performance and Data Efficiency

5. Design Benefits, Trade-offs, and Limitations

6. Related Architectures and Extensions

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research