Tokens Per Parameter: Scaling Language Models

Updated 7 October 2025

Tokens Per Parameter (TPP) is a metric that quantifies the ratio of training tokens to model parameters, reflecting compute efficiency and informing model scaling strategies.
It plays a crucial role in optimizing both dense and sparse architectures, demonstrating that the optimal token-to-parameter ratio varies with task characteristics like reasoning and memorization.
Advanced techniques such as pause tokens, latent tokens, and tailored training schedules further enhance TPP efficiency, driving improvements in model performance and robustness.

Tokens Per Parameter (TPP) is a foundational metric for analyzing, scaling, and interpreting the efficiency and capability of LLMs, particularly within the context of different model architectures, training regimens, and downstream task requirements. TPP quantifies the ratio of the total number of tokens used to train a model to the total number of its parameters, and it further serves as a diagnostic and optimization tool in both dense and sparsely activated architectures, including decoder-only Transformers and Mixture-of-Experts (MoE) models. Recent research demonstrates that TPP is not a universal constant but a variable with critical dependence on both model sparsity and the nature of the task—for example, memorization versus reasoning—requiring careful calibration to achieve optimal compute scaling, generalization, and robustness.

1. Definition and Formalization of Tokens Per Parameter (TPP)

Tokens Per Parameter (TPP) is formally expressed as:

$\text{TPP} = \frac{\text{Total Number of Tokens}}{\text{Total Number of Parameters}}$

In dense models, TPP primarily reflects the proportion of data seen relative to model capacity during training; scaling laws such as the Chinchilla recipe posit an optimal TPP (e.g., $20$) for efficient training of dense Transformers (Nakamura et al., 26 Aug 2025). In MoE architectures, TPP must be interpreted in conjunction with active sparsity—namely, how many parameters are activated per token—as total parameter count alone does not reflect the effective compute.

A key principle: memorization tasks benefit from lower TPP (more parameters per unit data), while reasoning tasks find maximal accuracy at a specific “sweet spot” of TPP, typically near $20$ (Nakamura et al., 26 Aug 2025). This non-monotonicity underscores the need for task- and architecture-aware compute allocation.

2. TPP in Dense and Sparse Model Architectures

Dense Transformers

In standard decoder-only models, TPP emerges directly from the total parameter count and the size of the training corpus. Scaling laws suggest memory and loss-efficient training when the token count is set at approximately $20$ times the parameter count. Optimal scaling is achieved when both model and dataset sizes are tuned to this regime (Nakamura et al., 26 Aug 2025).

Mixture-of-Experts Models

MoE models dramatically increase total (inactive) parameter count while keeping the per-token active compute constant via top- $k$ expert selection (Nakamura et al., 26 Aug 2025). The relevant sparsity is given by

$\text{sparsity} = 1 - \frac{\text{top-}k}{\text{number of experts}}$

Reasoning performance increases with both TPP up to optimal levels and with active FLOPs—that is, the actual compute expended per token through routing. Memorization benefits monotonically from decreased TPP (i.e., larger model sizes).

Practical Table: TPP and Active Compute

Task Type	TPP Trend	Active Compute Influence
Reasoning	Peaks at $\sim$ 20	Higher top- $k$ increases accuracy
Memorization	Monotonic decline	Less sensitive

3. TPP Augmentation via Pause Tokens and Latent Tokens

Pause tokens, as introduced in decoder-only Transformers, increase internal computations per output token by appending learnable dummy tokens ("<pause>") to the input sequence (Goyal et al., 2023). This allows the model to compute extra hidden vectors before committing to the next output token. The computation for an input of $K$ tokens augmented by $M$ pause tokens is:

For $k = 1, ..., K+M$ :

$a_k = \Phi_{LN}(\Phi_{Attn}(V_{1:k}, V_{1:k}, v_k) + v_k)$

$v'_k = \Phi_{LN}(\Phi_{FF}(a_k) + a_k )$

The final output for the $(K+1)$ th token uses the hidden state after all $M$ pause tokens, effectively “expanding” compute per token and raising effective TPP without substantially increasing model parameters.

Empirical evidence shows this results in notable EM score improvements on SQuAD (+18%), CommonSenseQA (+8%), and GSM8K (+1%) (Goyal et al., 2023).

Latent tokens similarly act as non-verbal, learnable tokens inserted at strategic positions (e.g., before commas) to provide auxiliary computation and improve TPP for tasks requiring long generation, information retrieval, or instruction adherence (Sun et al., 19 May 2025). Positional encoding of latent tokens is held constant with the subsequent verbal token, ensuring attention mechanisms see them as co-located computational blocks.

4. Impact of Temporal Compression and Token Redundancy Reduction

In multimodal models such as xGen-MM-Vid (BLIP-3-Video), temporal encoders and Token Turing Machines are used to compress large sets of frame-level tokens into a compact representation of as few as 32 tokens, drastically increasing TPP and overall efficiency (Ryoo et al., 21 Oct 2024). Token redundancy reduction modules (FPET) merge semantically similar tokens in self-attention layers using differentiable bipartite matching strategies supported by straight-through estimators (Kim et al., 26 Mar 2025). These processes allow for faster inference, reduced memory consumption, and competitive accuracy by maximizing per-parameter computational effectiveness.

5. TPP Interplay with Training Schedules and Optimization

Learning rate schedules directly affect the utility of high TPP regimes. Linearly decaying the learning rate to zero (D2Z) balances early bias reduction and late variance reduction, outperforming cosine decay (which only decays to a fraction) at high TPP (Bergsma et al., 21 Feb 2025). The exponential moving average (EMA) perspective on AdamW illustrates how D2Z leads to tighter averaging of recent updates as TPP increases, resulting in lower training and validation loss with up to 60% compute savings.

Key formulae:

Generalization error bound:

$\mathbb{E}[L(\theta_t) - L(\theta^*)] \le (1 - \eta\mu)^t \|\theta_0 - \theta^*\|^2 + \eta\sigma^2$

EMA weight:

$c_{t,i} = \left( \prod_{j=i+1}^{t} (1-\eta_j\lambda) \right) (\eta_i\lambda)$

Under D2Z, late-stage updates are fine-grained, matching the needs of high-TPP, data-rich regimes.

6. Conceptual Implications and Task-Optimal TPP Calibration

TPP is not task-invariant. For MoE models, reasoning requires an optimal TPP (empirically near 20) and increased active FLOPs (via higher top- $k$ ), while memorization benefits from maximal parameterization and can tolerate very low TPP (Nakamura et al., 26 Aug 2025). Architectures incorporating pause tokens, latent tokens, temporal compression, or redundancy reduction further decouple computational width/depth from parameter count, enabling novel scaling and adaptation strategies.

This suggests that compute-optimal scaling must be jointly determined by the interplay between total learned parameters, the number of training tokens, task requirements, and the realized active compute per token. Empirical findings show that neither RL post-training nor test-time compute adjustment modifies these fundamental trends, reinforcing that TPP tuning is key in both pre-training and architecture selection.

7. Future Directions

Contemporary research points to several open avenues:

Adaptive TPP scheduling per downstream task, especially in data-limited or compute-constrained scenarios.
Dynamic adjustment of pause/latent token placement based on model confidence or input content.
Deeper theoretical analysis of how TPP interacts with the latent implementation capacity of Transformers and MoE models.
Generalizing TPP optimization to encoder-decoder architectures and multi-modal learning frameworks.
Investigating the correlation between TPP and model robustness, OOD generalization, and load balancing considerations.

The evolving view of TPP positions it as a central axis for diagnostic, scaling, and architectural innovation in large-scale language modeling, with significant implications for task-specific performance, resource efficiency, and generalization.

In summary, TPP measurement and optimization represent an essential aspect of both dense and sparse model scaling, deeply intertwined with architectural choices (e.g., MoE sparsity, pause tokens, latent computation), training schedules, and downstream task alignment. Achieving optimal model performance and efficiency requires careful calibration of tokens per parameter in concert with active compute provisions, supporting nuanced approaches to computational resource allocation and model development for advanced reasoning and memorization tasks.