Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tokenizer Regression for Optimal Data Mixture (TREX)

Updated 27 January 2026
  • TREX is a regression-based framework that optimizes data mixtures for efficient and equitable multilingual tokenizer performance.
  • It utilizes proxy experiments and LightGBM regression to predict compression outcomes, significantly reducing computational expense.
  • Results demonstrate notable improvements in compression efficiency and performance over traditional heuristic mixing in diverse language settings.

Tokenizer Regression for Optimal Data Mixture (TREX) encompasses a family of methodologies for optimizing or inferring mixture ratios—typically in multilingual or multi-domain contexts—where the mixture proportions critically affect downstream model compression or inference. In modern LLMs, especially those with a multilingual focus, the efficiency and equity of subword tokenization are directly determined by the mixture of training data across languages or domains. The "TREX" framework provides a regression-based approach to efficiently identify the optimal data mixture for tokenizer training, achieving compression gains over traditional heuristics while drastically reducing computational expense. Independent work has exploited properties of learned tokenizers to reverse-engineer (infer) their underlying data mixtures, establishing the breadth of TREX's applicability from design to forensic analysis.

1. Multilingual Tokenizer Optimization: Problem Definition and Motivation

LLMs trained on data spanning many languages or domains require tokenizers whose segmentation is equitable and efficient across constituents. The tokenizer's compression performance—quantified by the average number of tokens required to represent reference text (Normalized Sequence Length, NSL)—strongly impacts both LLM training cost (FLOPs) and inference speed. Notably, compression efficiency is a function of both vocabulary size and the mixture ratios w=(w1,,wL)w=(w_1,\dots,w_L) of the language/domain data used in tokenizer construction. Imbalanced mixtures typically result in overfitting to high-resource constituents and underperformance for low-resource languages, manifesting as poor coverage and inflated sequence lengths.

Prevailing baselines for determining language-domain ratios include uniform mixing, heuristics derived from large pre-trained model vocabularies (e.g., LLaMA3, GPT-4o), or manual bucketed groupings. Large-scale searches, as exemplified by AdaptMix, are computationally intensive and generally infeasible for practitioners. Prior to TREX, no scalable, predictive method was available that could optimize the training mixture for the downstream objective of compression (Won et al., 20 Jan 2026).

2. The TREX Pipeline: Regression-based Data Mixture Optimization

The core of the TREX framework ("Tokenizer Regression for Optimal Data MiXture") is an efficient pipeline employing proxy experimentation and supervised regression to map mixtures to compression outcomes, thereby enabling cheap mixture optimization before committing to expensive large-scale tokenizer training.

The pipeline consists of four phases:

  1. Sampling Proxy Mixtures: Define the mixture simplex

W={w=(w1,,wL)R0L  i=1Lwi=1}\mathcal{W} = \bigl\{\,w=(w_1,\dots,w_L)\in\mathbb{R}^L_{\ge0}~|~\sum_{i=1}^L w_i=1 \bigr\}

and sample NN candidate mixtures {w(i)}i=1N\{w^{(i)}\}_{i=1}^N using a Dirichlet prior aligned with corpus sizes.

  1. Training Proxy Tokenizers and Measuring Compression: For each w(i)w^{(i)}, draw a small corpus (SsS_s bytes), train a tokenizer (vocabulary size VsV_s), and evaluate NSL on a held-out test set DtestD_{\text{test}}:

C(w(i);Ss,Vs)=1Ntestj=1NtestLen(Tw(i)(xj))Len(Tref(xj))C(w^{(i)};S_s,V_s) = \frac{1}{N_{\text{test}}}\sum_{j=1}^{N_{\text{test}}} \frac{\mathrm{Len}(T_{w^{(i)}}(x_j))}{\mathrm{Len}(T_{\text{ref}}(x_j))}

where Tw(i)T_{w^{(i)}} is the proxy tokenizer and TrefT_{\text{ref}} a reference (e.g., GPT-4o) tokenizer.

  1. Regression Model Fitting: Train LightGBM to predict f:WRf:\mathcal{W}\rightarrow \mathbb{R}, f(w)C(w;Ss,Vs)f(w)\approx C(w;S_s,V_s), using loss functions such as squared error or Mean Absolute Percentage Error (MAPE).
  2. Mixture Optimization: Search over W\mathcal{W} for

w=argminwWf(w)w^* = \arg\min_{w\in\mathcal{W}} f(w)

using gradient-free techniques. The final tokenizer is then trained at (SfS_f, VfV_f) using ww^*.

This pipeline efficiently replaces costly large-scale mixture sweeps with a O(minutes/hours)O(\text{minutes/hours}) regression-driven search (Won et al., 20 Jan 2026).

3. Algorithmic Formulation and Loss Structure

The TREX optimization adheres to a supervised regression paradigm:

  • Inputs: Small training corpus size SsS_s, vocabulary VsV_s, number of proxy mixtures NN.
  • Sampling: {w(i)}i=1NDirichlet(α)\{w^{(i)}\}_{i=1}^N\sim\mathrm{Dirichlet}(\alpha).
  • For each ii: Build tokenizer Tw(i)T_{w^{(i)}}, compute y(i)=C(w(i);Ss,Vs)y^{(i)}=C(w^{(i)};S_s,V_s).
  • Regression: Train LightGBM on {(w(i),y(i))}\{(w^{(i)},y^{(i)})\} to minimize

L=1Ni=1N[f(w(i))y(i)]2orMAPE=100%Ni=1Nf(w(i))y(i)y(i)\mathcal{L} = \frac{1}{N}\sum_{i=1}^N [f(w^{(i)})-y^{(i)}]^2 \quad\text{or}\quad \mathrm{MAPE} = \frac{100\%}{N}\sum_{i=1}^N \left|\frac{f(w^{(i)})-y^{(i)}}{y^{(i)}}\right|

  • Search: w=argminwWf(w)w^*=\arg\min_{w\in\mathcal{W}}f(w).
  • Finalization: Train tokenizer at (Sf,Vf)(S_f, V_f) with ww^*.

TREX achieves sub-2% MAPE and Spearman ρ0.97\rho\ge 0.97 on held-out mixtures, demonstrating high predictive fidelity and maintaining mixture ranking ("rank invariance") from proxy to full scale (Won et al., 20 Jan 2026).

4. Experimental Framework and Results

TREX was benchmarked using:

  • Datasets: FineWeb2-HQ (19 languages), FLORES-200 (out-of-distribution, OOD), Pile (medical domain), Global MMLU (16 languages).
  • Scales: From (Ss=1GB,Vs=64k)(S_s=1\,\mathrm{GB},V_s=64\,\mathrm{k}) proxies to large scale (Sf=30GB,Vf=200k)(S_f=30\,\mathrm{GB},V_f=200\,\mathrm{k}).
  • Baselines: Uniform mixture, language-bucket heuristic (Abagyan et al.), GPT-4o and LLaMA3 vocabulary proportions.

Key outcomes:

Mixture In-Distribution NSL OOD (FLORES) NSL Non-Latin NSL
TREX (ww^*) 0.871 0.877 0.814
Uniform 0.888 0.904 0.848
LLaMA3 0.907 0.863

TREX gives a –1.7 percentage point (+12.2% rel.) improvement over uniform in-distribution, –2.7 points OOD, and aggressive gains for non-Latin scripts. Ablations confirm rank invariance from proxies to full scale (Spearman ρ0.96\rho\ge 0.96), and domain transfers (medical domain: MAPE 0.921, ρ=0.981\rho=0.981) (Won et al., 20 Jan 2026).

5. Data Mixture Inference via Tokenizer Structure

A complementary direction uses TREX-style regression to infer the data mixture of an already trained tokenizer, leveraging the ordered BPE merge list as a signal (Hayase et al., 2024). For nn categories (languages, domains), the observed BPE merge sequence m(1),...,m(M)m(1),...,m(M) encodes implicit mixture information:

  • For held-out samples from each category DiD_i, simulate the first TT merges and track feature vectors fi(t)(p)f_i^{(t)}(p) (frequency of pair pp post-(t1)(t-1) merges).
  • For unknown mixture weights wR0nw\in\mathbb{R}^n_{\ge0}, formulate the LP:

minw,v,ut=1Tvt+t=1Tpm(t)ut,p\min_{w,v,u} \sum_{t=1}^T v_t + \sum_{t=1}^T \sum_{p\neq m(t)} u_{t,p}

subject to wi0w_i\ge0, iwi=1\sum_i w_i=1, vt,ut,p0v_t,u_{t,p}\ge0, and

Fm(t)(w,t)+vtFp(w,t)ut,pF_{m(t)}(w,t) + v_t \ge F_p(w,t) - u_{t,p}

where Fp(w,t)=i=1nwifi(t)(p)F_p(w,t)=\sum_{i=1}^n w_i f_i^{(t)}(p).

Empirically, this approach recovers mixture weights within 3–6 orders of magnitude better than random baselines. For instance, LLAMA 3's tokenizer is inferred to be 48.5% non-English, 30.2% code; GPT-4o is inferred at 39% non-English (Hayase et al., 2024).

6. Discussion, Critique, and Theoretical Considerations

Predictive Power: TREX's regression mapping is robust, achieving ρ0.98\rho\approx0.98 correlation to true compression rankings and outperforming all tested mixture heuristics empirically.

Efficiency: The method incurs a one-time cost of approximately 512 proxy tokenizers (∼1GB each), saving 41–60 GPU-hours and reducing LLM training duration by approximately 200 hours by lowering sequence lengths. In contrast, AdaptMix requires roughly 20 full-scale tokenizer trainings (Won et al., 20 Jan 2026).

Limitations: Rank invariance is validated from 1GB/64K to 30GB/200K scales, but not beyond 500K vocabs or 100GB corpora. Results are reported for 19 Indo-European languages; extreme typological variation may not obey the same mapping. TREX is optimized for NSL-based compression, not direct downstream task performance.

Potential Extensions: The regression framework can be extended jointly over vocabulary size and mixture (f(S,V,w)f(S,V,w)), applied to domain-specialized mixtures, or analyzed for transferability across script and typological groups. For inference, higher-order feature extraction or adaptation to other subword tokenizers (e.g., SentencePiece) are suggested (Won et al., 20 Jan 2026, Hayase et al., 2024).

7. Applications and Impact

TREX is directly applicable to building efficient and equitable multilingual LLM tokenizers, providing a principled mechanism for mixture selection and observable gains in both in-distribution and OOD settings. Inference applications reveal the composition of production-scale tokenizers, elucidating proprietary design decisions and the distributional makeup of pretraining corpora.

The framework establishes a scalable, regression-based paradigm for data mixture optimization and analysis, with demonstrated precision, robustness, and broad generalizability across linguistic and domain boundaries (Won et al., 20 Jan 2026, Hayase et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tokenizer Regression for Optimal Data Mixture (TREX).