Token-Superposition Training (TST)

Updated 2 July 2026

Token-Superposition Training is a method that fuses multiple tokens into a unified embedding to improve throughput and efficiency in language model training.
It employs a two-phase regime: a superposition phase that averages token embeddings with multi-hot cross-entropy loss, followed by a recovery phase using standard next-token prediction.
The SuperThoughts extension compresses chain-of-thought reasoning by predicting token pairs with an adaptive fallback, reducing computation without significant accuracy loss.

Token-Superposition Training (TST) is a methodology for increasing data-throughput efficiency in large-scale LLM pre-training and chain-of-thought (CoT) reasoning by representing and processing multiple tokens in superposed forms. TST, as formalized in recent works by Peng et al. (“Efficient Pre-Training with Token Superposition”) and extended in the SuperThoughts framework, operates by “fusing” groups of tokens into single embeddings or latent representations, training the model either to predict multiple subsequent tokens jointly or to compress multi-token reasoning steps into fewer forward passes. This strategy achieves substantial reductions in training and inference FLOPs, with robust empirical performance across model scales and tasks (Peng et al., 7 May 2026, Xiong et al., 11 Jun 2026).

1. Methodological Foundations

TST is implemented through two broad paradigms:

Pre-training throughput optimization (Peng et al., 7 May 2026): In this setting, TST is applied as a two-phase regimen. The model first undergoes a superposition phase in which input tokens are grouped into non-overlapping “bags” of size $s$ , each represented by the arithmetic mean of their embeddings. The LM is then tasked with predicting the next bag of $s$ tokens, using a multi-hot cross-entropy (MCE) objective that generalizes the usual one-hot cross-entropy loss to multi-target settings. After a tunable fraction $r$ of training steps, the process reverts to standard autoregressive next-token prediction for the remainder (“recovery phase”) with no modification to model, optimizer, or tokenizer.
Token-superposition in CoT reasoning (Xiong et al., 11 Jun 2026): Here, the discrete CoT token sequence $c_{1:L_c}$ is partitioned into pairs $(c_{2i-1}, c_{2i})$ , each compressed into a single latent $z_i$ by the Compressor module (either a linear projection or a shallow Transformer). The LLM backbone, acting as a sequence model over $z_i$ , alternates with a lightweight Multi-Token Prediction (MTP) head to predict two CoT tokens per inference step. An adaptive mechanism reverts to single-token decoding when the MTP’s prediction confidence is low.

2. Formal Algorithms and Training Phases

Superposition Phase:

Given batch shape $B \times L$ and bag size $s>1$ :

Partition each input into contiguous “bags” $\mathbf t = [t_{i+1}, ..., t_{i+s}]$
Replace $s$ 0 with averaged embedding $s$ 1
Increase either sequence length or batch size by $s$ 2 to maintain constant per-step FLOPs
For targets, form multi-hot vectors $s$ 3 for each bag $s$ 4 of $s$ 5 next tokens, and minimize

$s$ 6

Recovery Phase:

After $s$ 7 fraction of total steps, revert to plain next-token prediction.

Parameter Constraints:

Effective settings:

$s$ 8, $s$ 9

Compression:

For each token pair $r$ 0, produce latent $r$ 1, with two $r$ 2 options:

Linear: $r$ 3
Tiny Transformer: $r$ 4

Inference:

Main LLM processes the $r$ $r$ 5; at each step $r$ $r$ 6,
- Predicts odd-indexed token $r$ 7 via shared $r$ 8
- MTP head predicts even-indexed token $r$ 9 from RMS-normed embeddings and hidden states:
$c_{1:L_c}$ 0

$c_{1:L_c}$ 1

Training Objectives:

Compressor distillation (teacher-student): align hidden states using smoothed L1 loss
Full cross-entropy on all CoT and answer tokens

Adaptive Decoding:

On low MTP confidence ( $c_{1:L_c}$ 2), inference reverts to single-token step.

3. Empirical Performance

Comprehensive evaluation up to 10B A1B scale demonstrates:

On equal-FLOPs, TST enables throughput gains of up to $c_{1:L_c}$ 3 at 10B with MoE, as quantified by B200-GPU-hours.
For dense models (270M, 600M, 3B), TST matches or outperforms baselines in final validation loss and 0-shot downstream tasks.

Model	Parameters	Phase I Steps / Total	$c_{1:L_c}$ 4	Equiv. Tokens	B200h	Final Loss
Dense Baseline (270M)	270M	–	–	42B	34	3.212
Dense TST (270M)	270M	6,000/20,000	6	105B	34	3.142
Dense Baseline (600M)	600M	–	–	42B	61	3.019
Dense TST (600M)	600M	6,000/20,000	6	105B	61	2.943
Dense Baseline (3B)	3B	–	–	42B	247	2.808
Dense TST (3B)	3B	6,000/20,000	6	105B	247	2.676
MoE Baseline (10BA1B)	10BA1B	–	–	1.05T	12,311	2.252
MoE TST (10BA1B)	10BA1B	12,483/49,983	16	2.00T	4,768	2.236

On standard downstream tasks (HellaSwag, ARC, MMLU, BoolQ, PIQA), TST achieves parity or gains relative to baseline training.

Across Qwen2.5-Math models (1.5B, 7B, 14B):

CoT length reduction: 20–35% under adaptive decoding, up to ~50% with fixed 2-token steps.
Accuracy drop: Maintained within 1–2 percentage points for all major benchmarks (MATH500, AMC23, OlympiadBench, GPQA-Diamond).
Linear compressor matches Transformer variant—preferred for efficiency.

Example results for Qwen2.5-Math-7B-Instruct:

Baseline (MATH500): 83.0% / 538.6 tokens
SuperThoughts: 80.8% / 357.3 tokens (–34% length, –2.2pp accuracy)

Wall-clock speedups are confirmed (e.g., 32.8% CoT reduction yields 28.3% end-to-end time reduction). Non-adaptive use ( $c_{1:L_c}$ 5) cuts chain length further but substantially harms accuracy (–10–20pp).

4. Mechanistic Insights

TST’s improvements in throughput and data efficiency derive from two complementary mechanisms (Peng et al., 7 May 2026):

Input Superposition: The early-phase “averaged” embeddings provide local and global statistical priors cheaply over large corpora at reduced precision, enabling faster initial convergence.
Output Superposition: Predicting bags of tokens via MCE aggregates multiple local targets, increasing the effective data exposure per gradient step.

In the SuperThoughts formulation, latent superposition condenses reasoning steps, but joint supervision and adaptive fallback preserve the fidelity of multi-step reasoning. Critically, all such superposition is restricted to pre-training or intermediate latent space—no inference-time architectural changes are required in TST for LLMs, and SuperThoughts introduces negligible computational overhead with the lightweight MTP head.

Ablation studies demonstrate that both input and output superposition individually surpass the baseline, but their combination provides maximal benefit. Proper sharing of the token embedding and head between phases is essential; disruption by reinitialization negates the advantage.

5. Hyperparameters, Robustness, and Implementation

Default settings: For TST pre-training, $c_{1:L_c}$ 6 or $c_{1:L_c}$ 7 and $c_{1:L_c}$ 8– $c_{1:L_c}$ 9 provide robust gains. Uniform averaging in the MCE loss is recommended for $(c_{2i-1}, c_{2i})$ 0, while more complex weighting may help for larger bags.
Adaptivity: In multi-token inference, an MTP confidence threshold $(c_{2i-1}, c_{2i})$ 1 controls the fallback policy, optimizing the FLOP–accuracy tradeoff.
Implementation: TST requires no modification to model architecture, optimizer, tokenizer, or parallelization. MCE is realized by repeatedly applying one-hot cross-entropy in a loop. For SuperThoughts, a linear compressor suffices.
Ablations: Varying $(c_{2i-1}, c_{2i})$ 2 and $(c_{2i-1}, c_{2i})$ 3 indicates a U-shaped loss landscape in $(c_{2i-1}, c_{2i})$ 4 (optimal at medium values), and an optimal $(c_{2i-1}, c_{2i})$ 5 band aligning with $(c_{2i-1}, c_{2i})$ 6– $(c_{2i-1}, c_{2i})$ 7.
Inference: All superposition code is removed after the superposition phase in TST; the model continues with the baseline recipe. In SuperThoughts, the main LLM and MTP modules alternate, with confidence-based fallbacks minimizing quality loss.

6. Impact, Limitations, and Relations to Adjacent Methods

TST achieves significant data-throughput scaling without intrusive algorithmic changes or inference penalty. It is compatible with standard dense and mixture-of-experts (MoE) architectures. A core distinction from other compression or acceleration strategies is the explicit preservation of next-token supervision and seamless reversibility: after the superposition phase, the fully expressive autoregressive model is recovered and all efficiency gains are realized in pre-training compute, with no downstream tradeoff.

SuperThoughts extends TST principles to structured reasoning, compressing discrete CoT chains for double-token throughput with minimal degradation—outperforming methods that rely purely on latent-state reasoning without token-level supervision.

A plausible implication is that token-superposition may catalyze further advances in model pre-training regimes, multi-token prediction objectives, and efficient inference strategies across large-scale NLP systems. However, strong accuracy preservation at even higher compression ratios or in settings with highly non-local token dependencies remains to be systematically explored. Constraints on maximal $(c_{2i-1}, c_{2i})$ 8, optimal $(c_{2i-1}, c_{2i})$ 9, and architectural compatibility are empirically delineated but could be model-specific.

7. Summary Table: Comparative Results for TST Pre-Training

Model	Scale	$z_i$ 0	Superposition/Total Steps	Equiv. Tokens	GPU Time	Final Loss
Dense Baseline	3B	–	–	42B	247 h	2.808
Dense TST	3B	6	6,000/20,000	105B	247 h	2.676
MoE TST	10B A1B	16	12,483/49,983	2.00T	4,768 h	2.236

TST and its CoT variant provide a simple, robust, and backward-compatible mechanism for improving LLM efficiency without sacrificing inference quality or requiring complex architectural overhauls (Peng et al., 7 May 2026, Xiong et al., 11 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Efficient Pre-Training with Token Superposition (2026)

SuperThoughts: Reasoning Tokens in Superposition (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token-Superposition Training (TST).

Token-Superposition Training (TST)

1. Methodological Foundations

2. Formal Algorithms and Training Phases

TST Pre-Training Workflow (Peng et al., 7 May 2026)

Token Superposition in Reasoning ("SuperThoughts") (Xiong et al., 11 Jun 2026)

3. Empirical Performance

Pre-Training (TST) (Peng et al., 7 May 2026)

Chain-of-Thought Reasoning (SuperThoughts) (Xiong et al., 11 Jun 2026)

4. Mechanistic Insights

5. Hyperparameters, Robustness, and Implementation

6. Impact, Limitations, and Relations to Adjacent Methods

7. Summary Table: Comparative Results for TST Pre-Training

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Token-Superposition Training (TST)

1. Methodological Foundations

2. Formal Algorithms and Training Phases

TST Pre-Training Workflow (Peng et al., 7 May 2026)

Token Superposition in Reasoning ("SuperThoughts") (Xiong et al., 11 Jun 2026)

3. Empirical Performance

Pre-Training (TST) (Peng et al., 7 May 2026)

Chain-of-Thought Reasoning (SuperThoughts) (Xiong et al., 11 Jun 2026)

4. Mechanistic Insights

5. Hyperparameters, Robustness, and Implementation

6. Impact, Limitations, and Relations to Adjacent Methods

7. Summary Table: Comparative Results for TST Pre-Training

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics