Proportion-Dependent Multilingual Scaling Law
- Proportion-Dependent Multilingual Scaling Law is a framework that extends classic power-law scaling to model how token allocation proportions, model size, and data scale affect performance across languages.
- It uses empirically fitted exponents and transfer coefficients to optimize data mixtures, enabling precise prediction of per-language loss and aggregate performance.
- The law highlights emergent phenomena like the 'First-Parallel Leap' and transfer asymmetries, guiding optimal allocation even from small-scale pilots.
A Proportion-Dependent Multilingual Scaling Law (PDMSL) characterizes how the performance of large models trained on data spanning multiple languages (or language families) systematically varies as a function of the allocation proportions, model/data scale, and, in advanced formulations, the transfer/synergy between languages. These laws enable precise prediction of per-language or aggregate performance at arbitrary scales and corpus compositions, provide guidance for optimal data mixture design, and illuminate the limits and emergent properties of cross-lingual generalization.
1. Formal Definitions and Universal Formulations
Proportion-dependent scaling laws extend classic power-law scaling (e.g., Chinchilla) by introducing explicit dependence on the sampling proportions allocated to each language or group. A generic law for test loss of language (or family) , for non-embedding model size , total data , and = proportion of tokens from , is given by:
where , , , , , and are empirically fitted per family or language. The exponent controls the sensitivity of loss to the allocation proportion ; is estimated as the fraction of tokens sampled for family out of the total .
Multiple studies (multilingual language modeling (He et al., 2024), code LLMs (Yang et al., 15 Dec 2025), NMT (Fernandes et al., 2023), reasoning models (Yang et al., 2 Oct 2025)) empirically validate this structure, with variations in details reflecting differing degrees of cross-lingual transfer and model domain.
2. Family- and Language-Wise Laws and Optimal Allocation
A key empirical observation is that, when languages are coherently grouped (e.g., by linguistic family), the loss for each group often depends only on its own proportion, not that of unrelated groups. Under this minimal transfer regime, each decouples, allowing for straightforward prediction and optimization:
The optimal allocation problem (given weights for group importance) is:
For small , the weights that minimize the loss under resource constraints are:
This formula is robust across model scales, enabling the use of optimal sampling ratios derived from small models in large-scale pretraining (He et al., 2024). If , then and is scale-invariant.
3. Empirical Results and Implications
Table: Empirical Exponents and Impact Across Domains
| Domain | Law Structure | Key Exponents/Synergy | Notable Findings |
|---|---|---|---|
| Multilingual LMs (He et al., 2024) | Proportion-independent exponents, optimal from small models | ||
| Multilingual Code LLMs (Yang et al., 15 Dec 2025) | , , (synergy) | Token allocation guided by per-language exponents and pairwise synergy | |
| NMT (Fernandes et al., 2023) | Same for all mixes, gives effective capacity split | ||
| Reasoning LRM (Yang et al., 2 Oct 2025) | (MTI) | Marked "First-Parallel Leap" with diminishing marginal returns | |
| Speech (ASR/ST) (Chen et al., 14 Feb 2025) | Scaling aids low-resource languages; no explicit -dependence modeled |
Empirical studies confirm that allocating a higher enhances loss reduction for that family, but with strongly sublinear effects due to small . Crucially, allocating tokens using the computed (rather than uniform or raw-data-proportional) yields measurable aggregate gains and fairness, especially as family-level is easy to estimate in small-scale pilots (He et al., 2024).
In code LLMs, cross-lingual synergy is explicitly modeled by augmenting the data budget :
Positive (e.g., Java/C# or JavaScript/TypeScript) amplifies the benefit of proportionally increasing both languages. Such second-order corrections are validated by direct experiment (Yang et al., 15 Dec 2025) and outperform uniform allocations under fixed compute.
In NMT (Fernandes et al., 2023), mixture effects manifest almost purely in the coefficient, with the same power-law exponent regardless of . Effective capacity splits are accurately predicted via . Directional effects persist, with significant positive synergy for many-to-English setups and neutrality for English-to-many.
4. The "First-Parallel Leap" and Monolingual Generalization Gap
In LRM cross-lingual reasoning (Yang et al., 2 Oct 2025), the transition from monolingual to even a single additional parallel language produces a disproportionate gain ("First-Parallel Leap"). For example, transferability (MTI) jumps from 1.16 (X=1) to 2.50 (X=2), greatly exceeding the incremental benefits per added language thereafter.
Simultaneously, the monolingual generalization gap is defined as the shortfall between the observed monolingual metric and its power-law extrapolation from the multilingual regime:
This gap quantifies suboptimal transfer from monolingual-only training and persists across both accuracy and generalization metrics. The phenomenon is robust to task and model, indicating a structural limitation of monolingual pretraining with respect to cross-lingual reasoning.
5. Incorporating Cross-Lingual Transfer and Synergy
Advanced scaling laws such as ATLAS (Longpre et al., 24 Oct 2025) and the code LLM law (Yang et al., 15 Dec 2025) explicitly inject language-pair transfer through empirical transfer matrices or synergy coefficients. In ATLAS, the effective data budget is constructed as a sum over target, top-k transfer languages, and the remaining languages, weighted by learned and saturated for repeated tokens:
Transfer matrices—empirically measured, e.g., via BTS (Bilingual Transfer Score) in ATLAS—guide both initial assignment and optimal selection of (high-benefit transfer groups). These refinements yield greatly improved and robust held-out (e.g., for model size, for data, for unseen mixtures) far surpassing classic Chinchilla-style monolingual or uniform-multilingual laws (Longpre et al., 24 Oct 2025).
6. Practical Guidelines and Limitations
Practical design of pretraining mixtures involves:
- Estimating per-family or per-language (or Chinchilla exponents for code);
- Computing for target resource and model scales;
- Adjusting raw allocations to encourage synergy between high-transfer pairs;
- Validating performance under constraints (e.g., held-out loss, fairness objectives).
Scaling laws delivered from small models generalize with high fidelity to much larger scales (He et al., 2024), allowing rapid prototyping and efficient resource investment. However, limitations include:
- Assumptions of negligible cross-family transfer (violated for linguistically incoherent groupings) (He et al., 2024);
- Static corpus composition—adaptive, curriculum, or staged sampling is not addressed;
- Synergy and transfer coefficients (Yang et al., 15 Dec 2025, Longpre et al., 24 Oct 2025), are corpus- and architecture-dependent and may not extrapolate to domain-specific or low-resource languages without recalibration;
- For vision-language or speech settings, results demonstrate classic scaling but lack finely resolved -dependent experiments (Spravil et al., 12 Mar 2025, Chen et al., 14 Feb 2025).
7. Broader Impact, Generalization, and Frontier Directions
Proportion-dependent multilingual scaling laws establish a robust theoretical and empirical foundation for principled multilingual model construction. They unify the optimization of model/data scale with data mixing, span domains (text, code, speech, and reasoning), and imbue the model design process with predictable trade-offs:
- Raising representation for high-utility or underrepresented groups while avoiding over-investment in redundancy (fast-saturating languages);
- Harnessing and quantifying cross-lingual transfer for maximum resource efficiency;
- Quantitatively illuminating the phenomenon and magnitude of transfer asymmetries, fairness gains for low-resource contexts, and emerging capabilities at scale.
A plausible implication is that future extensions will integrate dynamic or task-adaptive data allocation, further refine transfer/synergy modeling, and generalize to multi-modal or multi-task settings with proportionally resolved scaling, catalyzing continued advances in equitable and efficient multilingual AI.
Key References:
(Yang et al., 2 Oct 2025) (Parallel Scaling Law for LRMs), (He et al., 2024) (Multilingual LM Scaling), (Yang et al., 15 Dec 2025) (Multilingual Code LLMs), (Fernandes et al., 2023) (Multilingual NMT Scaling), (Longpre et al., 24 Oct 2025) (ATLAS cross-lingual scaling), (Chen et al., 14 Feb 2025) (Multilingual Speech Scaling).