Language-Agnostic Scaling Law

Updated 28 October 2025

Language-agnostic scaling law is a mathematical framework that describes how model performance (e.g., test loss, generalization) scales with parameters like model size, dataset size, and number of languages.
It is grounded in empirical and theoretical analyses, leveraging power-law spectral decay and covariance properties that hold universally across languages and data mixtures.
The framework informs efficient resource allocation and model design by predicting performance trade-offs in both monolingual and multilingual setups.

A language-agnostic scaling law refers to a mathematical framework or empirical regularity that quantitatively describes how the performance (e.g., test loss, generalization, accuracy) of LLMs or linguistic phenomena scales with key variables such as model size, dataset size, number of languages, sparsity, or compositionality, in a way that holds independently of specific languages or scripts. Modern research demonstrates that such laws are not merely convenient extrapolations, but are deeply grounded in statistical properties—such as covariance spectra, cross-lingual transfer structure, or token distributions—that apply universally, or at least predictably, across languages, data mixtures, and architectures.

1. Fundamental Formulations and Universality

Empirical and theoretical research across several domains establishes that language modeling performance scales as precise power-law functions of the key resources involved, most commonly model size ( $N$ ), dataset size ( $D$ ), number of training languages ( $K$ ), and sometimes other architectural variables or data-specific features. The canonical form is:

$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$

where $L$ is measured cross-entropy loss, $N$ is (non-embedding) parameter count, $D$ is training data token count, $E$ is the irreducible loss floor, and $A, B, \alpha, \beta$ are fitted constants or exponents. This form is observed to hold across English, multilingual recipes, diverse scripts, and even outside human language, provided the underlying statistical properties (e.g., power-law data spectrum) are met (Kaplan et al., 2020, Maloney et al., 2022, Chen et al., 3 Mar 2025).

Language-agnosticism in this context means the same law holds regardless of whether the model is exposed to English, Chinese, Spanish, or a balanced multilingual mixture, as long as the data exhibits the necessary spectral properties (see §3). The exponents $\alpha, \beta$ , and in some cases the functional form of $L(N, D)$ , are robust to changes in language identity, family, or data mixture (He et al., 15 Oct 2024, Li et al., 12 Jun 2025, Longpre et al., 24 Oct 2025).

2. Theoretical Foundations and Mechanisms

The statistical origin of language-agnostic scaling laws has been formalized in both spectral analyses and solvable random feature models (Maloney et al., 2022, Chen et al., 3 Mar 2025). The key conditions are:

Power-law spectral decay in the data covariance or kernel: For a scaling law to persist, the eigenvalues $\lambda_i$ of the data/feature covariance should obey $\lambda_i \sim i^{-1-\alpha}$ , with $i$ the eigenvalue rank.
Nonlinear feature extension: Neural architectures provide nonlinear mappings that extend the effective power-law regime and thus scaling applicability beyond just the raw input dimension.
Equiparameterization: Near-optimal scaling is achieved by increasing model and dataset size together, $N \sim D$ , to avoid bottlenecking and plateaus.
Universality: These requirements do not depend on the target language or script, but on the large-scale statistical makeup of the data—enabling transfer of observed laws across languages and tasks (Maloney et al., 2022, Chen et al., 3 Mar 2025).

Scaling breaks down, manifesting as a plateau in loss, when $N$ or $D$ saturates the latent "information dimension" of the data $M$ , or when the power-law spectrum exhibits a cutoff.

3. Extensions to Multilingual and Multitask Regimes

3.1 Multilingual Scaling Laws

With the proliferation of multilingual LLMs, multiple studies have generalized scaling laws to multilingual pretraining and multitask settings. Crucial refinements include:

Family-wise independence: The test loss for each language family depends only on its own sampling ratio in the data mixture, independent of others (He et al., 15 Oct 2024):

$L_i(N, D, p_i) = \left(E_i + \frac{A_i}{N^{\alpha_i}} + \frac{B_i}{D^{\beta_i}}\right) p_i^{-\gamma_i}$

Here $p_i$ is the sampling fraction for family $i$ . This form enables global optimization and prediction for mixtures of arbitrary language composition.

Curse of multilinguality and transfer: The Adaptive Transfer Scaling Law (ATLAS) models how per-language loss behaves as languages are added, accounting for positive cross-lingual transfer via a data-dependent transfer matrix. ATLAS introduces a compositional law:

$L(K, N, D_t) = L_\infty + A \frac{K^\phi}{N^\alpha} + B \frac{K^\psi}{D_t^\beta}$

where $K$ is the number of languages, $D_t$ is tokens per language, and $\phi, \psi$ capture increased parameter and data demands from additional language coverage, with $\psi < 0$ reflecting positive transfer (Longpre et al., 24 Oct 2025).

Parameter efficiency and effective allocation: Effective parameter allocation per language or task in a multilingual setting can be precisely predicted via multiplicative constants learned empirically, independent of language similarity (Fernandes et al., 2023).

3.2 Cross-Lingual Reasoning and Parallel Scaling

Empirical law for cross-lingual reasoning generalization:

$f(X) = \alpha \cdot X^\beta$

where $X$ is the number of parallel languages used for joint training. The "First-Parallel Leap" denotes a large initial cross-lingual gain when moving from monolingual to bilingual training, and the "Monolingual Generalization Gap" quantifies the underperformance of monolingual models relative to scaling predictions, indicating that language-agnostic reasoning capability depends on both parallel, aligned data and model diversity (Yang et al., 2 Oct 2025).

4. Practical Implications and Resource Allocation

Model and data scaling: Empirical laws provide explicit recipes for allocating compute between model size and dataset size at any target scale (Kaplan et al., 2020, Li et al., 12 Jun 2025). Recent advances, such as the Farseer law, demonstrate that the optimal data-to-parameter ratio increases with overall compute budget, diverging from previous fixed-ratio heuristics:

$L(N, D) = e^{a_3 N^\gamma + b_3} + e^{a_2 N^\beta + b_2} \cdot D^{-e^{a_1 N^\alpha + b_1}}$

Here, the exponent of $D$ itself depends on $N$ , capturing interaction effects vital for reliable extrapolation (Li et al., 12 Jun 2025).

Sparse and dense architectures: A unified, architecture-agnostic law enables principled prediction of loss for both dense and sparse models, e.g., pruned and Mixture-of-Experts (MoE), by adding sparsity as a formal parameter:

$L(N, D, S) = e (1-S)^\gamma + [a (1-S)^\alpha + c S] \frac{1}{N^\alpha} + \frac{b}{D^\beta}$

with $S$ denoting sparsity, recovering all prior major laws as special cases (Hossain et al., 8 Aug 2025).

Data repetition and transfer: Laws account for data repetition via explicit saturation terms, and compute-efficient scaling by quantifying how additional languages, via cross-lingual transfer, affect optimal model and data scaling (Longpre et al., 24 Oct 2025).
Predictive model selection and fine-tuning: Rectified scaling laws, incorporating pre-learned data size, allow resource-efficient model selection for fine-tuning by capturing both pre-power and power-loss decay phases:

$\hat{L}(D) = \frac{B}{D_l + D^\beta} + E$

where $D_l$ is the effective pre-learned data exposure from pre-training (Lin et al., 4 Feb 2024).

5. Empirical Validation, Algorithmic Discovery, and Limitations

Empirical validation: Large-scale experiments (up to 1,000 LLM trainings, >400 languages, and 3 million GPU-hours) show that language-agnostic scaling laws hold with high predictive accuracy, even in extrapolation, e.g., Farseer achieves <1% relative error for 25B parameter models using fits from much smaller grids (Li et al., 12 Jun 2025, Longpre et al., 24 Oct 2025).
Automated law discovery: EvoSLD demonstrates that such language-agnostic scaling laws can be discovered automatically, without prior domain knowledge, by co-evolving symbolic expressions and optimizers across grouped experimental regimes, and verifying generalization by stratified test splits (Lin et al., 27 Jul 2025).
Limits to language-agnosticism: Certain linguistic systems violate classic scaling laws. For example, character-based languages (Chinese, Japanese, Korean) exhibit exponential decay in Zipf’s plots and three-stage vocabulary growth, due to finite vocabulary effects—canonical Zipf’s and Heaps’ laws do not hold universally (Lu et al., 2012).

| Language System | Zipf/Heaps Laws | Scaling Law Validity | |----------------------------|-----------------|--------------------------| | Alphabetic (English, etc.) | Yes | Power-law scaling holds | | Logographic (Chinese, etc.)| Deviate | Saturation/exponential |

Therefore, scaling laws are language-agnostic only within the range where the underlying statistical assumptions (e.g., large, effectively unbounded vocabulary, power-law data spectrum) are satisfied.

6. Synthesis: Principles and Future Directions

Language-agnostic scaling laws are a unifying principle in both the scientific paper of language and the engineering of modern LLMs:

They provide a quantitative foundation for extrapolating empirical results across languages, model sizes, data regimes, and architectures, with robust theoretical underpinnings in random matrix theory and statistical mechanics.
They enable principled, resource-efficient design of multilingual, multitask, and multimodal AI systems by making trade-offs computable and predictable.
They clarify both the scope and boundaries of universality: such laws require certain spectral and combinatorial conditions in the data and may break in systems with hard vocabulary caps or fundamentally distinct statistical structure.

The current trajectory suggests expansion of language-agnostic scaling laws into more granular regimes (e.g., per-language family, per-task), finer resource modeling (including sparsity and compositionality), and automated, interpretive discovery frameworks that further democratize linguistic AI research and application.