Adaptive Transfer Scaling Law (ATLAS)

Updated 28 October 2025

Adaptive Transfer Scaling Law (ATLAS) is an empirical and theoretical framework that predicts and optimizes neural network performance by modeling model size, data scale, and nuanced transfer interactions.
It integrates explicit transfer terms, repetition-aware data saturation, and heterogeneous formulations to capture both positive cross-domain transfer and negative interference effects.
Empirical validation in multilingual LLMs and vision tasks underlines ATLAS's ability to provide actionable compute-optimal scaling recipes and resource planning strategies.

The Adaptive Transfer Scaling Law (ATLAS) is an empirical and theoretical framework for predicting, modeling, and optimizing neural network performance as a function of model scale, data scale, and transfer interactions within and across domains—most notably in multilingual and data-constrained regimes. ATLAS refines earlier power-law scaling approaches by introducing explicit transfer terms, repetition-aware data saturation, and quantitative metrics for cross-domain transfer, achieving markedly more robust extrapolation and resource planning across diverse data mixtures and target tasks.

1. Conceptual Foundation and Scope

ATLAS was developed to fill critical methodological gaps in how scaling laws reflect transfer learning efficiency, especially given the complexities of multilingual training, cross-domain adaptation, and sparse downstream data. Whereas classic scaling laws modeled performance with simple power-laws in data or model size, ATLAS introduces heterogeneous, transfer-sensitive formulations that measure both positive and negative effects when scaling across multiple domains or languages.

The primary innovation is the modeling of "effective data exposure" for each target (e.g., language, domain, or downstream task), integrating baseline data, explicit transfer from correlated sources, and interference from unrelated mixtures. This enables ATLAS to account for both cross-lingual benefit and the curse of multilinguality, as well as boundaries in distillation vs. direct adaptation for data-limited vision tasks.

2. Mathematical Formulation

The ATLAS scaling law expresses expected validation loss for a target domain or language $t$ as:

$\mathcal{L}(N, \mathcal{D}_\mathrm{eff}) = E + \frac{A}{N^\alpha} + \frac{B}{\mathcal{D}_\mathrm{eff}^\beta}$

where:

$E$ is the irreducible (Bayes) error.
$A$ , $B$ , $\alpha$ , $\beta$ are fitted coefficients.
$N$ is model parameter count.
$\mathcal{D}_\mathrm{eff}$ is the effective data exposure for the target domain or language.

Effective data integrates contributions as: $\mathcal{D}_\mathrm{eff} = \mathcal{S}_\lambda(D_t; U_t) + \sum_{i\in\mathcal{K}_t} \tau_i\, \mathcal{S}_\lambda(D_i; U_i) + \tau_\mathrm{other}\, \mathcal{S}_\lambda(D_{\mathrm{other}}; U_{\mathrm{other}})$

$D_t$ and $U_t$ : Total and unique tokens in target.
$\mathcal{K}_t$ : Transfer source set.
$D_i$ , $U_i$ : Data and uniqueness for source $i$ ; $\tau_i$ : empirical transfer weight.
$\mathcal{S}_\lambda(D; U)$ $S_{λ} (D; U)$ : Data saturation function accounting for diminishing returns with multi-epoch repetition:
- $D \leq U$ : linear regime
- $D > U$ : $U[1 + (1 - e^{-\lambda(D/U-1)})/\lambda]$ , with saturation rate $\lambda$ .

Additional language-agnostic scaling formulas introduce a "curse of multilinguality" via per-language scaling exponents: $L(K,N,D_t) = L_\infty + A\,\frac{K^\phi}{N^\alpha} + B\,\frac{K^\psi}{D_t^\beta}$ Here, $K$ is the number of supported languages, $\phi$ and $\psi$ encode parameter capacity per language and cross-lingual data benefit/interference.

3. Empirical Validation and Transfer Matrices

ATLAS was validated in an extensive multilingual LLM paper (Longpre et al., 24 Oct 2025), spanning 774 training experiments, 10M–8B parameter models, 400+ training languages, and 48 evaluation languages. The key advances include:

Out-of-sample $R^2$ improvements: ATLAS achieves $R^2$ up to $0.98$ on pooled loss prediction and $0.82$ for extrapolation across novel language mixes, consistently outperforming leading alternatives such as Chinchilla or MSL by $>$ 0.1–$0.3$.
Empirical cross-lingual transfer matrices: Bilingual Transfer Scores (BTS) directly measure how introducing a source language $s$ improves or interferes with target $t$ , via normalized convergence rate comparisons.
Language family and script correlations: Transfer is strongest among related languages/scripts; English is generally a positive donor. Transfer asymmetry and interference are empirically quantified.

Scaling Law	$R^2$ (all)	$R^2(N)$	$R^2(D)$	$R^2(C)$	$R^2(M)$
Chinchilla (Multi)	0.64	-0.99	0.72	0.66	0.61
He et al. (MSL)	0.67	-0.65	0.73	0.67	0.70
ATLAS ( $D_t$ only)	0.70	-0.75	0.80	0.72	0.64
ATLAS (full transfer)	0.98	0.89	0.96	0.98	0.82

Table: ATLAS outperforms prior laws in held-out axes (from (Longpre et al., 24 Oct 2025))

4. Transfer, Crossover, and Resource Optimization

ATLAS provides actionable thresholds for practitioners on when to prefer multilingual finetuning or monolingual pretraining:

Compute crossover points: For limited budgets (e.g., $<$ 144B tokens for 2B models), finetuning from a multilingual checkpoint delivers superior efficiency. For larger budgets ( $>$ 283B tokens), monolingual pretraining surpasses.
Model scaling recipes: To maintain per-language performance when growing language coverage, practitioners should scale model and data as $N \propto r^{\phi/\alpha}$ , $D_\mathrm{tot} \propto r^{1+\psi/\beta}$ , and compute as $r^{1+\phi/\alpha+\psi/\beta}$ .
Distillation boundary theory (Yang et al., 17 Apr 2025): In data-limited transfer for vision, distilled models are superior below a critical upstream data threshold $D_p^*$ , above which direct pretrain-finetune overtakes—explicitly quantified via fitted scaling law parameters.

5. Applications Across Vision, Language, and Scientific Domains

ATLAS generalizes across modalities:

Vision transfer laws (Yang et al., 17 Apr 2025): Error and cross-entropy losses scale predictably with pretraining data, model size, and fine-tuning data. Pretraining data $D_p$ is the singularly most decisive factor.
Few-shot classification (Prato et al., 2021): Error rates for new classes (out-of-distribution) following pretraining obey steeper power-law convergence than in-distribution classes; scaling pretraining data or classes delivers clear gains.
Sim2Real in materials science (Minami et al., 7 Aug 2024): Prediction error $\epsilon$ decays as power-law in computational dataset size: $\epsilon \propto N^{-\beta}$ ; this underpins database planning for real-world property prediction.
Synthetic-to-real transfer (Mikami et al., 2021): Transfer gap $C$ quantifies the irreducible error due to domain difference; practitioners can empirically estimate whether collecting more synthetic data will bear fruit or if enhanced diversity/realism is necessary.

6. Methodological Extensions and Practical Guidance

Hyperparameter transfer (Bjorck et al., 30 Sep 2024): Optimal learning rates for LLM training scale as $LR^*(D) = B D^{-\beta}$ : longer horizons necessitate smaller LR; practical transfer is achieved with nearly zero overhead via empirical scaling.
Model shape optimization (Anagnostidis et al., 2023): Adaptive training protocols dynamically adapt architecture (patch size, context, width) during training, traversing scaling laws for maximal compute efficiency; up to 40–60% reduction in FLOPs required for target performance.

Domain	Core ATLAS Utility
Multilingual LLMs	Predicting/optimizing loss with transfer/interference
Vision downstream tasks	Data-efficient transfer, distillation boundaries
Few-shot/out-of-domain learning	Scaling law for error, convergence rate differences
Materials science (Sim2Real)	Database planning for error minimization
Hyperparameter optimization	Scaling LR with data horizon, efficient transfer
Adaptive model architecture	Compute-optimal schedules with dynamic shape

7. Limitations and Interpretability

ATLAS models are parameterized with empirical transfer weights, exponents, and saturation rates, which require substantial cross-domain experimentation to estimate. Transfer is not universally positive; mixture composition, domain similarity, and phase transitions (e.g., distillation boundaries) must be determined for each application via curve fitting and iso-loss extrapolation. The law is agnostic to the source of transfer, accommodating both positive and negative interactions.

A plausible implication is that future work may refine ATLAS with mechanistic or representation-level transfer metrics, potentially improving its interpretability and extrapolation fidelity in new regimes or modalities.

8. Summary

The Adaptive Transfer Scaling Law (ATLAS) provides a unified, empirically validated, and transfer-aware framework for predicting and optimizing transfer learning dynamics across multilingual, vision, scientific, and model-scaling settings. By formalizing the interplay of model size, data, cross-domain transfer, and saturation effects, ATLAS establishes actionable recipes for scaling model infrastructure, resource allocation, and achieving compute-optimal performance in real-world tasks where data composition and transfer efficiency are critical (Longpre et al., 24 Oct 2025, Yang et al., 17 Apr 2025, Prato et al., 2021, Minami et al., 7 Aug 2024, Bjorck et al., 30 Sep 2024, Anagnostidis et al., 2023, Mikami et al., 2021, Barnett, 30 Aug 2024, Hernandez et al., 2021).