Farseer Law: Scaling for Code LLMs

Updated 2 November 2025

Farseer Law is a refined scaling framework for code LLMs that models validation loss as a function of model size (N) and training tokens (D) with N-dependent exponents.
It employs a combination of stretched exponential and power-law coupling to accurately extrapolate loss behaviors from small experiments to full-scale, compute-intensive settings.
Farseer Law outperforms classical scaling laws like Chinchilla by revealing super-linear data-to-parameter requirements and enabling near-zero loss predictions at high resource regimes.

Farseer Law is a refined, empirically-driven scaling law for LLM training, developed to address limitations in prior scaling law formulations such as Chinchilla in accurately predicting loss as a function of model size and data size across scales and modalities. The law provides substantially enhanced predictive accuracy in extrapolating performance metrics from tractable small-scale experiments to highly resource-intensive production LLM settings, with explicit coupling between parameter count ( $N$ ) and training data token count ( $D$ ). Distinctively, Farseer law's exponents are themselves parameterized as functions of $N$ , yielding a flexible model of how loss decays with additional data and scale. It is especially valuable in code modeling, where the data-to-parameter ratio must grow super-linearly and previous scaling heuristics consistently underestimate data requirements.

1. Mathematical Formulation and Expressive Structure

Farseer law specifies the expected validation loss $L(N, D)$ of an LLM in terms of $N$ (number of parameters) and $D$ (number of unique training tokens) via a stretched exponential and power-law coupling:

$L(N, D) = \exp(sN^{q} + S) + \exp(BN^{b} + Q)\, D^{- \exp(AN^{a} + E)}$

Here, $s, q, S, B, b, Q, A, a, E$ are empirically determined real coefficients. For code LLMs, the fitted law is:

$\begin{split} L(N,D) &= \exp(-0.0047 \cdot N^{0.239} - 0.8188)\ &+ \exp(62.8936 \cdot N^{-0.0614} - 14.0414) \cdot D^{-\exp(-0.0209 \cdot N^{0.1943} - 0.1826)} \end{split}$

Farseer law's critically expressive feature is that the exponent of $D$ is an exponential function of $N$ , allowing the rate at which loss decreases with more data to vary with model size, capturing nontrivial interactions between $N$ and $D$ . This enables precise modeling of loss landscapes over compute budgets and dataset sizes not possible with fixed-exponent laws.

2. Comparison with Classical Scaling Laws

Chinchilla law, originally developed for natural language, is a simpler model:

$L(N, D) = E + \frac{A}{N^a} + \frac{B}{D^b}$

with constant exponents $a$ and $b$ . For code, the best Chinchilla fit is:

$L(N, D) = 0.2193 + \frac{534.374}{N^{0.4853}} + \frac{76.0743}{D^{0.2983}}$

Farseer law improves upon Chinchilla in key respects:

Expressiveness: Chinchilla's constant exponents fail to capture how larger models can leverage data more efficiently. Farseer law models this directly.
Loss Limit: Chinchilla implies a fixed irreducible loss ("entropy limit") as $N, D \to \infty$ . Farseer law allows loss to asymptotically approach zero for large enough $N$ and $D$ , aligning with deterministic behavior in code contexts (where long prefixes almost always determine next tokens).
Optimal Data-to-Parameter Ratio: Chinchilla predicts a flat optimal $D/N$ ratio; Farseer reveals this ratio must grow rapidly with $N$ and total compute C for code tasks.

Aspect	Chinchilla Law	Farseer Law (for code)
Formula	$L = E + \frac{A}{N^a} + \frac{B}{D^b}$	$L = e^{sN^q+S} + e^{BN^b+Q}D^{-e^{AN^a+E}}$
Exponents	Constant	Vary smoothly with $N$
Fit Accuracy	Decent, suboptimal	Excellent, mean relative error $<$ 1‰
D/N Scaling	Flat	Super-linear in $N$ and compute
Loss Limit	Nonzero (fixed entropy)	Zero possible with large $N$ , $D$
Data Hunger	Moderate	Substantially higher for code

3. Empirical Validation and Predictive Power

Farseer law was validated over 117 code LLMs ( $N: 0.2 \mathrm{B}-3.8 \mathrm{B}$ , $D: 2 \mathrm{B}-128 \mathrm{B}$ tokens), with direct fits showing:

Mean Relative Error: 0.82‰ for Farseer vs. 1.03‰ for Chinchilla.
Loss Prediction: Farseer accurately predicts loss across a wide $D/N$ space, including at code-optimal and Chinchilla-optimal points.
Scaling Trends: No loss saturation across explored scales, with both $N$ and $D$ yielding robust improvements.
Data Hunger: For code, optimal $D/N$ is up to $7\times$ higher than Chinchilla's prediction; at high compute budgets, even more data is needed.
Super-linear Scaling: Larger models require disproportionately more data for loss minimization, contradicting the flat scaling assumption used for NL models.

Empirical analyses also demonstrate that as the model and data grow, the entropy rate for code approaches zero with context size, theoretically justifying Farseer law's irreducible loss predictions.

4. Implications for LLM Training and Compute Allocation

Using Farseer law for code LLM training mandates a departure from NL-driven strategies:

Compute-Optimal Allocation: Solve for optimal $(N, D)$ on a fixed compute budget $C$ via Farseer's parameterization.
Data Acquisition: Aggressively source large, high-quality code corpora; rare/complex examples are especially valuable.
Frontier Model Scaling: As $N$ increases, $D$ must be scaled super-linearly to maintain optimal performance—expect to scale data 5–10 $\times$ higher (or more) than NL practices.
NL Data Mixing: While NL data helps small-scale or data-constrained training (acts as regularization), it degrades performance when code data and compute are sufficient: at large scales, pure code data is always optimal.
Practical Planning: Farseer law supports precise extrapolation. For instance, ablation studies at small scales can guide planning for very large models by reliably predicting loss improvements and data requirements.

5. Application in Code vs. Natural Language Modeling

Code LLMs fundamentally differ from NL LLMs in their scaling regime:

Code is “more data-hungry”: Loss minima are only achieved with much greater $D/N$ than for NL tasks.
No Saturation Observed: Unlike NL, code models continue to benefit from increased parameter and data sizes at all explored scales.
Uniqueness of Prefixes: Most code prefixes are unique, so next-token prediction becomes nearly deterministic as context and data grow, justifying Farseer law’s loss→0 behavior.
Size Limitation Rationale: Code models are typically smaller in practice not due to inefficiency, but to the scarcity of sufficiently diverse code datasets.

6. Guidance for Code LLM Engineering

For practitioners designing or training code LLMs:

Always use Farseer law (not Chinchilla or NL guidelines) for planning, budget allocation, and scaling.
Target very high code data collection, especially rare patterns, since data quantity is often the ultimate bottleneck.
Expect and plan for high $D/N$ ratios—standard NL LLM ratios are inadmissible for code modeling.
Maintain parallel scaling of $N$ and $D$ , following the Farseer-predicted scaling curve, to avoid sub-optimal loss plateaus.
Mix NL only under strict data scarcity, and revert to pure code as resources allow.

7. Summary Table: Farseer vs. Chinchilla for Code LLMs

Property	Chinchilla	Farseer
Formula	Constant exponents	N-dependent exponents
Fit to Code LLM	Adequate	Superior
D/N Scaling	Nearly constant	Super-linear
Loss Limit	Nonzero	Asymptotically zero
Practicality	Insufficient	Accurate/Optimal

8. Conclusion

Farseer law establishes a scalable, expressive, and empirically valid paradigm for planning, training, and extrapolating performance of code LLMs. It decisively supersedes natural language-derived scaling heuristics, showing that effective code modeling requires orders of magnitude more data and super-linear scaling regimes. By reconciling loss landscapes with code-specific structural properties and providing actionable guidance across compute budgets, Farseer law is the definitive tool for researchers and practitioners advancing the frontier of LLM-based software engineering (Luo et al., 9 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Scaling Laws for Code: A More Data-Hungry Regime (2025)

Follow Topic

Get notified by email when new papers are published related to Farseer Law.