Papers
Topics
Authors
Recent
Search
2000 character limit reached

Efficient Pre-Training with Token Superposition

Published 7 May 2026 in cs.CL | (2605.06546v1)

Abstract: Pre-training of LLMs is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

Summary

  • The paper introduces Token Superposition Training (TST) to reduce pre-training time by up to 2.5x without altering the model architecture.
  • TST employs a two-phase approach using input superposition and multi-hot cross-entropy for output prediction to boost tokens-per-FLOP.
  • Experimental results across scales show robust improvements in loss and downstream task performance while preserving inference behavior.

Efficient Pre-Training of LLMs via Token Superposition Training

Introduction and Motivation

Efficient scaling of LLMs is critically bottlenecked by data throughput and compute efficiency during pre-training, especially as state-of-the-art models increasingly leverage overtraining far beyond compute-optimal regimes to maximize inference-time performance. Existing methods to improve efficiency modulate input representations (tokenization advances, auxiliary losses), decrease per-token compute (sparse MoEs, attention), or compress intermediate representations (compressive architectures). However, none decouple training-time efficiency from architectural and inference changes without adding system or model complexity.

This paper introduces Token Superposition Training (TST), a self-contained training paradigm that achieves up to a 2.5x reduction in pre-training time at 10B parameter scale, strictly via increased data throughput per FLOP, making no changes to optimizer, parallelism, tokenizer, data pipeline, or model architecture (2605.06546). TST applies in two phases: (i) a "superposition" regime where ss contiguous tokens are embedded and predicted jointly as bags via a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase reverting to canonical autoregressive (AR) training, feeding forward the initialization from TST. Crucially, inference-time behavior and architecture are unaffected.

Method: Token Superposition Training

TST comprises two orthogonal mechanisms:

Input Superposition: During phase (i), token sequences are folded into non-overlapping bags of size ss. These bags are superposed by averaging their embeddings prior to model input, reducing effective sequence length to L/sL/s per training example, but increasing raw data consumption and thus tokens-per-FLOP by ssx.

Output Superposition: Rather than predicting a single next token, the model predicts the next bag of ss tokens (the next ss contiguous tokens, unordered) using a multi-hot cross-entropy (MCE) loss. The target distribution is uniform over the correct bag. This modifies training to encourage latent representations informative of multiple continuation possibilities.

After rr fraction of training steps with TST, the model resumes exactly standard AR next-token prediction, leveraging the updated weights—with no architectural, optimizer, or data changes. Figure 1

Figure 2: Superposition results with respect to loss at varying bag sizes and superposition step ratios, showing strong improvement as TST is applied.

Experimental Results

Extensive experiments are reported at 270M, 600M, 3B, and 10B (MoE) scales, including the Qwen3-derived A1B MoE architecture:

  • IsoFLOPs / IsoLoss Regimes: Under equal FLOPs or equal final loss, TST consistently obtains superior loss and downstream evaluation results versus matched compute AR baselines.
  • Speedup: For the 10B model, TST achieves a 2.5x reduction in wall-clock time to a fixed validation loss.
  • Evaluation: 0-shot performance on a diverse suite of tasks (e.g., ARC-Challenge, HellaSwag, Winogrande, MMLU) demonstrates robust gains for TST-trained models, with maximal improvements at bag sizes s=4s=4–$8$ and r=0.2r=0.2–ss0. Figure 3

    Figure 4: Downstream evaluation metrics at varying superposition bag sizes and TST ratios, aggregated over major LM benchmarks, illustrating maximal benefits for moderate bag sizes and step ratios.

(TST is found robust to bag hyperparameter; performance degrades only for excessive ss1 or ss2.)

Analysis and Ablations

Input vs. Output Superposition

Ablation experiments demonstrate that both input and output superposition independently improve over standard AR training, but their combination realizes synergistic gains. Input superposition alone can be interpreted as coarse-to-fine curriculum, echoing recent findings from ViT patch resizing and subword-to-byte training schedules.

Loss Weighting

The MCE loss can be weighted across tokens in the output bag. Empirically, uniform weighting is optimal for ss3; at larger ss4, a power-law weighting (reflecting the long-range decay of token mutual information) gives slight improvements. Figure 5

Figure 6: Comparison between uniform and power-law output loss weighting at various superposition settings; power-law improves stability for large ss5.

Phase Alignment

If the embedding and output layers are not aligned across the two TST phases (e.g., via random re-initialization before recovery), all gains are lost, highlighting the criticality of representation alignment when transferring from superposition regime to AR—explaining why previous compressive methods required explicit adapters.

Comparison to Prior Art

TST differs fundamentally from Multi-Token Prediction (MTP) or future summary prediction [gloeckle_better_2024; mahajan_beyond_2025]: these add auxiliary loss terms or heads, but do not increase tokens-per-FLOP nor yield robust improvements without extra parameters or tuning. TST is orthogonal, as it directly increases data efficiency per compute.

Unlike compressive or byte-level architectures [minixhofer_bolmo_2025; pagnoni_byte_2025], TST does not modify autoregressive inference at all, evading post-training adaptation or complexity.

Limitations and Future Work

TST is most advantageous in the compute-bound regime; if data is scarce, its increased data appetite may be a liability. Output-only superposition, however, provides moderate improvements without increased data consumption, and future work should combine TST with auxiliary losses or alternate tokenization (e.g., byte-level or concept-level). The mechanistic basis (curricular vs. geometric regularization) for TST’s effect remains open; interpretability and scaling law analyses are warranted.

Theoretical and Practical Implications

TST demonstrates that token-level input/output granularity and training curriculum can be decoupled from inference architecture, increasing training efficiency without architectural reforms. This supports the hypothesis that much of subword tokenization’s advantage is via induced training throughput, rather than subword priors per se [gigant_decoupling_2026]. Compressing training inputs or outputs—without increasing inference cost or sacrificing expressivity—constitutes a reusable, domain-agnostic optimization applicable to LLMs and potentially other sequence models.

Conclusion

Token Superposition Training provides a robust, easily-integrable framework for accelerating LLM pre-training by maximizing sample throughput under fixed compute. It yields consistently lower loss and improved evaluation scores with no inference cost or system complexity, and outpaces alternative methods that require auxiliary architectural changes. The framework opens new directions for hybrid curricula, multi-granularity LMs, and principled data-compute co-optimization. The alignment of representation space across learning phases emerges as the critical element for successful compressive curriculum, resolving limitations of prior approaches. Figure 7

Figure 1: Learning rate sweeps validate optimal hyperparameters for all model sizes; TST performance is robust given correct scheduling.

Figure 8

Figure 8

Figure 3: Downstream evals (HellaSwag, ARC-Easy) across TST settings confirm systematic, stable improvement compared to baseline AR models.

Reference:

Efficient Pre-Training with Token Superposition (2605.06546)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 18 tweets with 289 likes about this paper.