Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multilingual Scaling Law

Updated 9 June 2026
  • Multilingual Scaling Law is a framework that quantifies model performance as a function of model size, data volume, and language mixture ratios across diverse domains.
  • It generalizes monolingual power-law models by incorporating mixture-dependent terms, cross-lingual transfer coefficients, and game-theoretic valuations to capture capacity and data challenges.
  • Empirical findings reveal that adaptive sampling and targeted parameter increases can mitigate the multilingual capacity tax, optimizing performance across language and domain variations.

A multilingual scaling law defines how performance, loss, or error metrics of models—whether LLMs, translation systems, speech recognizers, code LLMs, or multimodal models—depend on the interplay of model size, data size, and the linguistic composition of the training data in settings where multiple languages (or language families, programming languages, etc.) are present. Multilingual scaling laws generalize the monolingual “Chinchilla” or power-law scaling laws by explicitly modeling how language mixture ratios, inter-language transfer, and the curse of multilinguality modulate scaling exponents, irreducible losses, and effective gains from additional parameters or tokens.

1. Mathematical Forms of Multilingual Scaling Laws

Multilingual scaling laws typically extend the monolingual law:

L(N,D)=E+ANα+B/DβL(N, D) = E + \frac{A}{N^\alpha + B/D^\beta}

by introducing mixture-dependent terms, cross-lingual transfer coefficients, and allocation strategies. Representative formulations include:

  • Mixture ratio-based law (language families):

Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}

where pip_i is the sampling ratio of family ii; γi\gamma_i is the power-law decay rate with respect to sampling (He et al., 2024).

  • Effective data via transfer matrix (ATLAS):

Lt(N,Deff)=E+ANα+B/(Deff)β\mathcal{L}_t(N, \mathcal{D}_\mathrm{eff}) = E + \frac{A}{N^\alpha + B / (\mathcal{D}_\mathrm{eff})^\beta}

with

Deff=Sλ(Dt;Ut)+iKtτiSλ(Di;Ui)+τotherSλ(Dother;Uother)\mathcal{D}_\mathrm{eff} = \mathcal{S}_\lambda(D_t; U_t) + \sum_{i \in \mathcal{K}_t} \tau_i \mathcal{S}_\lambda(D_i; U_i) + \tau_\mathrm{other} \mathcal{S}_\lambda(D_\mathrm{other}; U_\mathrm{other})

where Sλ\mathcal{S}_\lambda is a saturation function reflecting epoch-driven diminishing returns, and τi\tau_i are cross-lingual transfer coefficients (Longpre et al., 24 Oct 2025).

  • Game-theoretic law (ShapleyLaw):

Lj(N,D,p)=Ej+AjNαj+Bj/Dβj(Θj)γjL_j(N, D, p) = E_j + \frac{A_j}{N^{\alpha_j} + B_j/D^{\beta_j}} \cdot (\Theta_j)^{-\gamma_j}

with Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}0, where Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}1 is a normalized Shapley-value quantifying transfer from Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}2 to Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}3 (Cao et al., 18 Mar 2026).

  • Law for code LLMs with pairwise synergies:

Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}4

where Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}5 and exponent/intercept terms are weighted by mixture (Yang et al., 15 Dec 2025).

  • Capacity and data taxes for Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}6 languages (ATLAS):

Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}7

with capacity penalty exponent Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}8 and mild data efficiency gain Li(N,D,pi)=[Ei+Ai/(Nαi+Bi/Dβi)]piγiL_i(N, D, p_i) = [ E_i + A_i / (N^{\alpha_i} + B_i/D^{\beta_i}) ] \cdot p_i^{-\gamma_i}9 (Longpre et al., 24 Oct 2025).

These forms enable practitioners to predict per-language/family loss as a function of language allocation, model/data scale, and cross-lingual transfer structure.

2. Cross-Lingual Transfer: Matrices, Synergies, and Taxonomies

Empirical work establishes that cross-lingual transfer can be quantitatively modeled and is highly non-uniform:

  • Transfer matrices: The ATLAS law defines a cross-lingual transfer matrix of empirical benefit scores (e.g., Bilingual Transfer Score, Finetuning Adaptation Score) measuring how co-training on language pip_i0 helps or hinders convergence for pip_i1. Script similarity and language family are dominant predictors of positive transfer, with English being the highest-utility donor for 19/30 targets (Longpre et al., 24 Oct 2025).
  • Pairwise synergy: In code LLMs, pairwise transfer coefficients pip_i2 capture how the presence of one programming language boosts or diminishes effective data for another; balanced allocation among high-synergy pairs (e.g., Java–C#, JavaScript–TypeScript) is crucial (Yang et al., 15 Dec 2025).
  • Game-theoretic Shapley values: ShapleyLaw uses Shapley-value decomposition to attribute observed loss reduction to individual languages, correcting traditional scaling laws for the omitted variable bias introduced by cross-lingual transfer effects (Cao et al., 18 Mar 2026).
  • Scaling law independence and family granularity: In some cases (e.g., (He et al., 2024)) the validated hypothesis is that each family’s scaling curve is determined solely by its own sampling ratio, justifying mixture optimization at the family level.

3. The Curse and Mitigation of Multilinguality

A universal finding across multilingual scaling studies is that increasing the number of languages imposes a capacity “tax,” manifest as a degradation of performance at fixed model size and data per language:

  • In the ATLAS language-agnostic scaling law, the per-language loss scales with pip_i3 in capacity and pip_i4 in per-language data, so doubling the number of languages requires pip_i5 more parameters and pip_i6 fewer tokens per language, but much more total compute—pip_i7 (Longpre et al., 24 Oct 2025).
  • Models smaller than 100M parameters can see up to 30–50% performance blowup when moving from mono- to 50-language mixtures; for 2B–8B models, this shrinks to 5–10% (Longpre et al., 24 Oct 2025).
  • Practically, adding parameters is more effective than increasing tokens for mitigating the curse. Selective mixing and adaptive sampling based on empirically positive transfer relationships further mitigates capacity loss.

4. Optimal Data Allocation and Scaling Policy

Multilingual scaling laws provide actionable algorithms for mixture selection:

  • The optimal sampling ratio pip_i8 for language (or family) pip_i9 is given by either

ii0

for language-family laws (He et al., 2024), or by gradient-based optimization over mixture simplex with cross-lingual transfer matrices (Yang et al., 15 Dec 2025, Cao et al., 18 Mar 2026).

  • Proportion-dependent code LLM laws recommend upweighting high-exponent, high-utility (rapidly scaling) languages (e.g., Python), maintaining strong synergy pairs, and allocating less to fast-saturating or low-utility languages (e.g., Rust, Go). Adjustments of ii1–ii2\% from uniform capure most of the attainable gain (Yang et al., 15 Dec 2025).
  • ATLAS delivers isoperformance curves: to keep loss unchanged when multiplying the set of languages by ii3, scale parameters as ii4 and tokens per language as ii5, with total token budget and compute increasing superlinearly in ii6 (Longpre et al., 24 Oct 2025).
  • In practice, mixture ratios and transfer coefficients can be estimated at small scale, and reused at much larger scales with minimal loss of optimality (He et al., 2024, Cao et al., 18 Mar 2026).

5. Empirical Findings Across Language, Code, Speech, and Multimodal Domains

Multilingual scaling laws display broad empirical validity:

  • LLMs and LLMs: Family-level scaling laws (γ_i typically <0.15) fit observed loss for 23 languages across five families, with predictive accuracy maintained from 85M to 1.2B parameters (He et al., 2024). Chinchilla-style scaling with mixture-adaptive exponent/intercept terms is validated in massive studies with up to 400+ languages (Longpre et al., 24 Oct 2025).
  • Neural machine translation: Decoder-only models follow size and data Chinchilla laws, but the scaling exponent and irreducible loss differ substantially by language direction and domain. Capacity allocation among language pairs is well-modeled by effective-parameter laws, with mixture weights modulating only the multiplicative factor of the loss curve (Caillaut et al., 2024, Fernandes et al., 2023).
  • Code LLMs: Interpreted languages exhibit higher scaling exponents (e.g., ii7) than compiled languages (e.g., ii8), reflecting greater scalability with model/data size. Scaling laws enable mixture optimization yielding 1–2 point absolute gains in Pass@1 and BLEU over uniform (Yang et al., 15 Dec 2025).
  • Speech: OWLS shows WER and BLEU follow power-laws in ii9, γi\gamma_i0, γi\gamma_i1, with model size being the dominant axis, especially for low-resource languages/dialects: doubling γi\gamma_i2 typically buys γi\gamma_i3 WER reduction (Chen et al., 14 Feb 2025).
  • Vision-LLMs: The Florenz study demonstrates that cross-task, cross-language generalization, including zero-shot emergence in unseen languages, follows predictable power laws where model size is the dominant factor when direct caption data is unavailable (Spravil et al., 12 Mar 2025).
  • Reasoning models: The parallel scaling law quantifies how cross-lingual generalization of reasoning abilities grows as a power-law in the number of training languages (scaling exponent γi\gamma_i4 for transferability), with rapidly diminishing returns beyond the “first-parallel leap” (Yang et al., 2 Oct 2025).

6. Optimization, Practical Design, and Frontier Issues

State-of-the-art scaling laws support practical design choices and open new research directions:

  • Optimization: Multilingual data allocation is a convex optimization over the simplex, tractable even with 10–20 languages via Monte-Carlo or analytic gradient descent (Cao et al., 18 Mar 2026, Yang et al., 15 Dec 2025).
  • Transfer estimation: Cross-lingual transfer strengths (e.g., Shapley values, transfer matrices) are empirically stable across model and data scales, requiring only small-scale estimation.
  • Pretrain vs finetune: For given compute, ATLAS predicts when it is optimal to finetune from a multilingual checkpoint versus pretraining from scratch, with crossover thresholds depending on model size and available data (Longpre et al., 24 Oct 2025).
  • Limits: Validity of power-law/Chinchilla forms is empirically confirmed only within a factor of 10 around the largest model/data considered; reliable extrapolation beyond this range or to substantially different data distributions is unproven (Caillaut et al., 2024, Longpre et al., 24 Oct 2025).

7. Theoretical and Linguistic Underpinnings

Underlying these empirical multilingual scaling laws are foundational observations about vocabulary finiteness, statistical universality, and preferential attachment:

  • In natural language, the behavior of frequency-rank distributions and the growth in distinct units follow rich-get-richer dynamics, modulated by the effective size and novelty rate of the vocabulary (e.g., divergent Zipf exponent and three-stage Heaps’ law for finite-inventory writing systems) (Lu et al., 2012).
  • For word-rich languages, scaling exponents remain constant and power-laws hold across mixtures; for low-novelty or character-based systems, scaling can exhibit sharp regime changes (power-law to exponential decay in rank-frequency, saturation in new unit growth) (Lu et al., 2012).
  • These statistical phenomena mirror, at a higher level of abstraction, the emergence of cross-lingual transfer “taxes,” synergy, and mixture-dependent scaling in the deep learning era.

In summary, multilingual scaling laws provide a precise mathematical and empirical framework to forecast, optimize, and interpret performance trade-offs in models trained over multiple languages or language-like domains. They enable systematic exploration of mixture strategies, expose the computational cost of broadening language coverage, and clarify the capacity, transfer, and saturation effects central to next-generation universal modeling and AI democratization (Longpre et al., 24 Oct 2025, He et al., 2024, Yang et al., 15 Dec 2025, Cao et al., 18 Mar 2026, Fernandes et al., 2023, Caillaut et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multilingual Scaling Law.