Papers
Topics
Authors
Recent
2000 character limit reached

Proportion-Dependent Multilingual Scaling Law

Updated 22 December 2025
  • Proportion-Dependent Multilingual Scaling Law is a framework that extends classic power-law scaling to model how token allocation proportions, model size, and data scale affect performance across languages.
  • It uses empirically fitted exponents and transfer coefficients to optimize data mixtures, enabling precise prediction of per-language loss and aggregate performance.
  • The law highlights emergent phenomena like the 'First-Parallel Leap' and transfer asymmetries, guiding optimal allocation even from small-scale pilots.

A Proportion-Dependent Multilingual Scaling Law (PDMSL) characterizes how the performance of large models trained on data spanning multiple languages (or language families) systematically varies as a function of the allocation proportions, model/data scale, and, in advanced formulations, the transfer/synergy between languages. These laws enable precise prediction of per-language or aggregate performance at arbitrary scales and corpus compositions, provide guidance for optimal data mixture design, and illuminate the limits and emergent properties of cross-lingual generalization.

1. Formal Definitions and Universal Formulations

Proportion-dependent scaling laws extend classic power-law scaling (e.g., Chinchilla) by introducing explicit dependence on the sampling proportions allocated to each language or group. A generic law for test loss LfL_f of language (or family) ff, for non-embedding model size NN, total data DD, and pfp_f = proportion of tokens from ff, is given by:

Lf(N,D,pf)=(Ef+AfNαf+BfDβf)pfγfL_f(N, D, p_f) = \left(E_f + \frac{A_f}{N^{\alpha_f}} + \frac{B_f}{D^{\beta_f}}\right) p_f^{-\gamma_f}

where EfE_f, AfA_f, BfB_f, αf\alpha_f, βf\beta_f, and γf\gamma_f are empirically fitted per family or language. The exponent γf>0\gamma_f > 0 controls the sensitivity of loss to the allocation proportion pfp_f; pfp_f is estimated as the fraction of tokens sampled for family ff out of the total DD.

Multiple studies (multilingual language modeling (He et al., 2024), code LLMs (Yang et al., 15 Dec 2025), NMT (Fernandes et al., 2023), reasoning models (Yang et al., 2 Oct 2025)) empirically validate this structure, with variations in details reflecting differing degrees of cross-lingual transfer and model domain.

2. Family- and Language-Wise Laws and Optimal Allocation

A key empirical observation is that, when languages are coherently grouped (e.g., by linguistic family), the loss for each group often depends only on its own proportion, not that of unrelated groups. Under this minimal transfer regime, each LfL_f decouples, allowing for straightforward prediction and optimization:

Ltot=f=1nwfLf(N,D,pf)L_\text{tot} = \sum_{f=1}^n w_f L_f(N,D,p_f)

The optimal allocation problem (given weights wfw_f for group importance) is:

minpΔn  Ltot=f=1nwfLf(N,D)pfγfs.t.fpf=1\min_{p \in \Delta_n}\; L_\text{tot} = \sum_{f=1}^{n} w_f L_f^\star(N,D)\,p_f^{-\gamma_f} \quad \text{s.t.} \sum_{f} p_f = 1

For small γf\gamma_f, the weights that minimize the loss under resource constraints are:

pfwfLf(N,D)γfiwiLi(N,D)γip_f^* \approx \frac{w_f L_f^\star(N,D) \gamma_f}{\sum_{i} w_i L_i^\star(N,D) \gamma_i}

This formula is robust across model scales, enabling the use of optimal sampling ratios derived from small models in large-scale pretraining (He et al., 2024). If wf=1/Lfw_f = 1/L_f^\star, then pfγfp_f^* \propto \gamma_f and is scale-invariant.

3. Empirical Results and Implications

Table: Empirical Exponents and Impact Across Domains

Domain Law Structure Key Exponents/Synergy Notable Findings
Multilingual LMs (He et al., 2024) Lf=LfpfγfL_f = L_f^\star \, p_f^{-\gamma_f} γf0.050.1\gamma_f\sim0.05-0.1 Proportion-independent exponents, optimal pfp_f^* from small models
Multilingual Code LLMs (Yang et al., 15 Dec 2025) L(N,D;p)=ANαN(p)+BDxαD(p)+L(p)\mathcal{L}(N,D;p) = A N^{-\alpha_N(p)} + B D_x^{-\alpha_D(p)} + L_\infty(p) αNk\alpha_N^k, αDk\alpha_D^k, τij\tau_{ij} (synergy) Token allocation guided by per-language exponents and pairwise synergy
NMT (Fernandes et al., 2023) Li(N;wi)=Ci(wi)Nαi+L(i)L_i(N;w_i) = C_i(w_i) N^{-\alpha_i} + L_\infty^{(i)} Ci(wi)wiγiC_i(w_i)\propto w_i^{-\gamma_i} Same αi\alpha_i for all mixes, fi(wi)f_i(w_i) gives effective capacity split
Reasoning LRM (Yang et al., 2 Oct 2025) P(X)=αXβP(X) = \alpha X^{\beta} β=0.29\beta=0.29 (MTI) Marked "First-Parallel Leap" with diminishing marginal returns
Speech (ASR/ST) (Chen et al., 14 Feb 2025) Y(N,D,C)=c+ANNα+ADDβ+ACCγY(N,D,C) = c + A_N N^{-\alpha} + A_D D^{-\beta} + A_C C^{-\gamma} α0.19\alpha\sim0.19 Scaling aids low-resource languages; no explicit pp-dependence modeled

Empirical studies confirm that allocating a higher pfp_f enhances loss reduction for that family, but with strongly sublinear effects due to small γf\gamma_f. Crucially, allocating tokens using the computed pfp_f^* (rather than uniform or raw-data-proportional) yields measurable aggregate gains and fairness, especially as family-level γf\gamma_f is easy to estimate in small-scale pilots (He et al., 2024).

In code LLMs, cross-lingual synergy is explicitly modeled by augmenting the data budget DxD_x:

Dx=Dall(1+γijpipjτij)D_{x} = D_{\rm all}\,\left(1 + \gamma\sum_{i\neq j} p_i p_j \tau_{ij} \right)

Positive τij\tau_{ij} (e.g., Java/C# or JavaScript/TypeScript) amplifies the benefit of proportionally increasing both languages. Such second-order corrections are validated by direct experiment (Yang et al., 15 Dec 2025) and outperform uniform allocations under fixed compute.

In NMT (Fernandes et al., 2023), mixture effects manifest almost purely in the coefficient, with the same power-law exponent αi\alpha_i regardless of wiw_i. Effective capacity splits are accurately predicted via fi(wi)=Neff(i)/Nf_i(w_i) = N_\text{eff}^{(i)} / N. Directional effects persist, with significant positive synergy for many-to-English setups and neutrality for English-to-many.

4. The "First-Parallel Leap" and Monolingual Generalization Gap

In LRM cross-lingual reasoning (Yang et al., 2 Oct 2025), the transition from monolingual to even a single additional parallel language produces a disproportionate gain ("First-Parallel Leap"). For example, transferability (MTI) jumps from 1.16 (X=1) to 2.50 (X=2), greatly exceeding the incremental benefits per added language thereafter.

Simultaneously, the monolingual generalization gap is defined as the shortfall between the observed monolingual metric and its power-law extrapolation from the multilingual regime:

Gapt=Pt(1)predictedMTIactual\text{Gap}_t = P_t(1)_{\mathrm{predicted}} - \mathrm{MTI}_{\mathrm{actual}}

This gap quantifies suboptimal transfer from monolingual-only training and persists across both accuracy and generalization metrics. The phenomenon is robust to task and model, indicating a structural limitation of monolingual pretraining with respect to cross-lingual reasoning.

5. Incorporating Cross-Lingual Transfer and Synergy

Advanced scaling laws such as ATLAS (Longpre et al., 24 Oct 2025) and the code LLM law (Yang et al., 15 Dec 2025) explicitly inject language-pair transfer through empirical transfer matrices or synergy coefficients. In ATLAS, the effective data budget is constructed as a sum over target, top-k transfer languages, and the remaining languages, weighted by learned τ\tau_\ell and saturated for repeated tokens:

Deff=τSλ(D;U)\mathcal{D}_{\rm eff} = \sum_{\ell} \tau_\ell \mathcal{S}_\lambda(D_\ell; U_\ell)

Transfer matrices—empirically measured, e.g., via BTS (Bilingual Transfer Score) in ATLAS—guide both initial τ\tau_\ell assignment and optimal selection of Kt\mathcal{K}_t (high-benefit transfer groups). These refinements yield greatly improved and robust held-out R2R^2 (e.g., R2=0.89R^2=0.89 for model size, R2=0.96R^2=0.96 for data, R2=0.82R^2=0.82 for unseen mixtures) far surpassing classic Chinchilla-style monolingual or uniform-multilingual laws (Longpre et al., 24 Oct 2025).

6. Practical Guidelines and Limitations

Practical design of pretraining mixtures involves:

  • Estimating per-family or per-language γf\gamma_f (or Chinchilla exponents for code);
  • Computing pfp_f^* for target resource and model scales;
  • Adjusting raw allocations to encourage synergy between high-transfer pairs;
  • Validating performance under constraints (e.g., held-out loss, fairness objectives).

Scaling laws delivered from small models generalize with high fidelity to much larger scales (He et al., 2024), allowing rapid prototyping and efficient resource investment. However, limitations include:

  • Assumptions of negligible cross-family transfer (violated for linguistically incoherent groupings) (He et al., 2024);
  • Static corpus composition—adaptive, curriculum, or staged sampling is not addressed;
  • Synergy and transfer coefficients (Yang et al., 15 Dec 2025, Longpre et al., 24 Oct 2025), are corpus- and architecture-dependent and may not extrapolate to domain-specific or low-resource languages without recalibration;
  • For vision-language or speech settings, results demonstrate classic scaling but lack finely resolved pp-dependent experiments (Spravil et al., 12 Mar 2025, Chen et al., 14 Feb 2025).

7. Broader Impact, Generalization, and Frontier Directions

Proportion-dependent multilingual scaling laws establish a robust theoretical and empirical foundation for principled multilingual model construction. They unify the optimization of model/data scale with data mixing, span domains (text, code, speech, and reasoning), and imbue the model design process with predictable trade-offs:

  • Raising representation for high-utility or underrepresented groups while avoiding over-investment in redundancy (fast-saturating languages);
  • Harnessing and quantifying cross-lingual transfer for maximum resource efficiency;
  • Quantitatively illuminating the phenomenon and magnitude of transfer asymmetries, fairness gains for low-resource contexts, and emerging capabilities at scale.

A plausible implication is that future extensions will integrate dynamic or task-adaptive data allocation, further refine transfer/synergy modeling, and generalize to multi-modal or multi-task settings with proportionally resolved scaling, catalyzing continued advances in equitable and efficient multilingual AI.


Key References:

(Yang et al., 2 Oct 2025) (Parallel Scaling Law for LRMs), (He et al., 2024) (Multilingual LM Scaling), (Yang et al., 15 Dec 2025) (Multilingual Code LLMs), (Fernandes et al., 2023) (Multilingual NMT Scaling), (Longpre et al., 24 Oct 2025) (ATLAS cross-lingual scaling), (Chen et al., 14 Feb 2025) (Multilingual Speech Scaling).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Proportion-Dependent Multilingual Scaling Law.