Intermediate Token Generation (ITG)

Updated 10 September 2025

Intermediate Token Generation (ITG) is a framework where models produce explicit intermediate tokens to enhance sample diversity, computational efficiency, and adaptive processing.
ITG methodologies are applied across domains such as score-based generative models, motion synthesis, large language model inference, and image super-resolution, demonstrating tailored performance improvements.
Empirical evaluations reveal that ITG can reduce compute requirements, improve output quality, and enable adaptive depth scaling, with metrics showing significant FLOPs savings and enhanced accuracy.

Intermediate Token Generation (ITG) refers to the mechanisms by which computational models emit, manipulate, or optimize tokens—usually latent codes or stepwise outputs—before the final solution or output is produced. ITG manifests across domains ranging from generative modeling, motion synthesis, LLM inference, discrete diffusion models, and recommendation systems. ITG is proposed to improve sample diversity, reasoning transparency, computational efficiency, and adaptive processing depth. The following sections provide a comprehensive account of ITG methodologies, empirical behaviors, and interpretations across recent literature.

1. Foundational Principles of Intermediate Token Generation

ITG encompasses the training and deployment of models that produce explicit intermediate states as part of their forward operation. In score-based generative modeling, ITG leverages discretized forward processes (e.g., stochastic differential equations) where each step yields an intermediate iterate, which can be coupled to dedicated encoders and decoders to formulate regularizers and broaden generative capacity (Mishra et al., 2023). In autoregressive frameworks, intermediate tokens may represent reasoning traces such as chain-of-thought (CoT) sequences, derivational logs, or stepwise partial predictions.

Core mathematical formulations include:

For score-based models, the loss is evaluated at multiple intermediate times, not just the terminal corruption point:

$R_T(x) = \sum_{t=1}^T L(D_t \circ s_t \circ E_t, x_t)$

In transformer architectures for motion interpolation, latent manifolds are constructed so intermediate tokens form continuous embeddings constrained by sparse keyframes (Mo et al., 2023):

$\hat{m}_t = \begin{cases} \mathrm{FFN}(\Phi^{\text{key}}(K)_t), & t \in K \ m_t, & \text{otherwise} \end{cases}$

The rationale behind ITG is to exploit intermediate representations for improved sample complexity, diversity, or adaptivity. However, these benefits are highly implementation-dependent.

2. ITG in Score-based Generative Models

Intermediate Generator Optimization (IGO) (Mishra et al., 2023) demonstrates that using intermediate forward iterates in score-based models:

Allows the construction of an ensemble of generators parameterized by the noise schedule or SDE time, enabling application-specific tradeoffs between sample diversity and reconstruction accuracy (e.g., image extrapolation, point cloud denoising).
Furnishes strong sample complexity bounds for inverse problems such as Generative PCA. The Minkowski sum of outputs from distinct intermediate generators expands the expressivity.
Introduces negligible computational overhead, as additional intermediate decoding pathways are shallow compared to the main autoencoder backbone.

This paradigm is distinct from traditional likelihood maximization, as it regularizes mappings at multiple steps, not just the input and output.

3. ITG in Transformer-based Motion Interpolation

Recent transformer frameworks for motion interpolation formalize ITG as continuous manifold traversal between sparse keyframes (Mo et al., 2023):

Intermediate tokens are not merely interpolants but learned continuous embeddings derived from a manifold constrained by keyframe positions and contexts.
Attention is restricted such that intermediate frames draw keys and values exclusively from keyframes, rendering generated poses both smooth and precise.
Quantitative results on LaFAN1 and CMU Mocap datasets indicate that continuous intermediate token generation yields superior interpolation accuracy (L2P, L2Q, NPSS) and visual coherence compared to LERP or BERT-based masked interpolation methods.

The competitive advantages stem from manifold modeling and architectural normalization (sequence-level recentering, continuous positional embedding concatenation), which mitigate trivial local minima and preserve temporal precision.

4. ITG for Efficient Inference in LLMs

Instruction-tuned early-exit methods utilizing Losses from InTermediate layErs (LITE) (Varshney et al., 2023):

Train LLMs with explicit cross-entropy loss at intermediate layers using a shared language modeling head, empowering those layers to achieve high-fidelity generation.
Deploy dynamic confidence-based early exiting: during inference, if an intermediate layer's prediction exceeds a specified confidence threshold, subsequent layers are skipped for that token.
Results on LLaMA-2 show that up to 46.35% FLOPs can be saved on 13B parameters while maintaining near-identical generation quality (semantic similarity ≈ 0.90 versus the full model).

Unlike speculative decoding or static pruning, this ITG method adapts generation depth per token instance, directly exposing the alignment between layerwise predictions and final-layer outputs.

5. ITG in Diffusion-based Super-resolution

The Iterative Token Evaluation and Refinement (ITER) framework (Chen et al., 2023) operates in a discrete token space (VQGAN-based) for real-world super-resolution:

ITG is split into two sub-tasks: distortion removal and texture generation. Low-quality images are mapped to “clean” high-quality tokens, then iteratively refined using a discrete diffusion process.
A token evaluation network adaptively determines which parts of the token map should be refined or preserved, enabling variable-length inference and adaptive quality control.
Compared to GANs (difficult to balance losses) or continuous diffusion (requiring hundreds of steps), ITER completes inference in as few as eight iterations, offering training simplicity and substantial efficiency.

This model highlights how ITG can decouple deterministic restoration tasks from stochastic generative texture enhancement for improved perceptual metrics.

6. ITG and Adaptive Depth Scaling

The Inner Thinking Transformer (ITT) (Chen et al., 19 Feb 2025) exemplifies adaptive ITG through:

Token-wise dynamic routing: tokens deemed critical undergo additional iterative “thinking” steps, focusing computational resources where reasoning complexity spikes.
Residual Thinking Connections: instead of one-shot mappings, iterative corrections are accumulated, which stabilize gradient flow and effect exponential error decay:

$x^{(t)} = x^{(t-1)} + \alpha^{(t)} \cdot [f(x^{(t-1)}) \odot \phi^{(t)}]$

Thinking Step Encoding further informs the model as to which phase of reasoning it occupies, supporting differentiated and elastic computation.
Benchmarks show ITT achieves competitive accuracy with significantly smaller models and reduced training data, outperforming standard transformer and loop-based architectures.

ITT is thus a direct architectural application of ITG, offering parameter-efficient, problem-adaptive refinement.

7. Misconceptions and Caveats: ITG Trace Length vs. Problem Complexity

Empirical analysis (Palod et al., 9 Sep 2025) critically evaluates whether intermediate token sequence length correlates with problem difficulty:

Models trained to output derivational traces (e.g., step logs of A* search) do not reliably produce trace lengths reflective of computational complexity.
Even on trivial problems, sequence length can be excessive, and solution generation may fail. When problems are similar to the training distribution, trace length weakly mimics ground-truth trace length, but correlation vanishes for out-of-distribution inputs.
The trace length is mediated more by distributional recall rather than real-time problem-adaptive computation.

This evidence challenges the assumption in the field that longer ITG trace lengths equate to increased "thinking effort" or adaptive computation, cautioning against anthropomorphizing such mechanisms.

Summary Table: ITG Mechanisms Across Domains

Domain	ITG Role	Key Properties / Findings
Score-based Generative	Loss at intermediate SDE steps	Enhanced diversity, sample complexity bounds
Motion Synthesis	Manifold-constrained token generation	Increased accuracy, continuity, avoids local minima
LLM Inference	Layerwise decoding, dynamic exit	Up to 46% compute savings, maintains output quality
Image Super-resolution	Discrete iterative token refinement	Training simplicity, inference within 8 steps
Reasoning/Trace Models	Chain-of-thought sequence generation	Trace length ≠ problem complexity, ties to training
Transformer Architectures	Adaptive depth, residual refining	Error decay, stable learning, parameter efficiency

Intermediate Token Generation now forms a key axis of methodological advancement in generative modeling, efficient inference, and adaptive reasoning architectures. Its practical benefits depend critically on implementation modality, integration with model architecture, and attention to distributional effects. Critically, assumptions relating ITG sequence properties with underlying problem complexity or "reasoning effort" should be empirically validated rather than presumed.