Farseer Law: Scaling for Code LLMs
- Farseer Law is a refined scaling framework for code LLMs that models validation loss as a function of model size (N) and training tokens (D) with N-dependent exponents.
- It employs a combination of stretched exponential and power-law coupling to accurately extrapolate loss behaviors from small experiments to full-scale, compute-intensive settings.
- Farseer Law outperforms classical scaling laws like Chinchilla by revealing super-linear data-to-parameter requirements and enabling near-zero loss predictions at high resource regimes.
Farseer Law is a refined, empirically-driven scaling law for LLM training, developed to address limitations in prior scaling law formulations such as Chinchilla in accurately predicting loss as a function of model size and data size across scales and modalities. The law provides substantially enhanced predictive accuracy in extrapolating performance metrics from tractable small-scale experiments to highly resource-intensive production LLM settings, with explicit coupling between parameter count () and training data token count (). Distinctively, Farseer law's exponents are themselves parameterized as functions of , yielding a flexible model of how loss decays with additional data and scale. It is especially valuable in code modeling, where the data-to-parameter ratio must grow super-linearly and previous scaling heuristics consistently underestimate data requirements.
1. Mathematical Formulation and Expressive Structure
Farseer law specifies the expected validation loss of an LLM in terms of (number of parameters) and (number of unique training tokens) via a stretched exponential and power-law coupling:
Here, are empirically determined real coefficients. For code LLMs, the fitted law is:
$\begin{split} L(N,D) &= \exp(-0.0047 \cdot N^{0.239} - 0.8188)\ &+ \exp(62.8936 \cdot N^{-0.0614} - 14.0414) \cdot D^{-\exp(-0.0209 \cdot N^{0.1943} - 0.1826)} \end{split}$
Farseer law's critically expressive feature is that the exponent of is an exponential function of , allowing the rate at which loss decreases with more data to vary with model size, capturing nontrivial interactions between and . This enables precise modeling of loss landscapes over compute budgets and dataset sizes not possible with fixed-exponent laws.
2. Comparison with Classical Scaling Laws
Chinchilla law, originally developed for natural language, is a simpler model:
with constant exponents and . For code, the best Chinchilla fit is:
Farseer law improves upon Chinchilla in key respects:
- Expressiveness: Chinchilla's constant exponents fail to capture how larger models can leverage data more efficiently. Farseer law models this directly.
- Loss Limit: Chinchilla implies a fixed irreducible loss ("entropy limit") as . Farseer law allows loss to asymptotically approach zero for large enough and , aligning with deterministic behavior in code contexts (where long prefixes almost always determine next tokens).
- Optimal Data-to-Parameter Ratio: Chinchilla predicts a flat optimal ratio; Farseer reveals this ratio must grow rapidly with and total compute C for code tasks.
| Aspect | Chinchilla Law | Farseer Law (for code) |
|---|---|---|
| Formula | ||
| Exponents | Constant | Vary smoothly with |
| Fit Accuracy | Decent, suboptimal | Excellent, mean relative error 1‰ |
| D/N Scaling | Flat | Super-linear in and compute |
| Loss Limit | Nonzero (fixed entropy) | Zero possible with large , |
| Data Hunger | Moderate | Substantially higher for code |
3. Empirical Validation and Predictive Power
Farseer law was validated over 117 code LLMs (, tokens), with direct fits showing:
- Mean Relative Error: 0.82‰ for Farseer vs. 1.03‰ for Chinchilla.
- Loss Prediction: Farseer accurately predicts loss across a wide space, including at code-optimal and Chinchilla-optimal points.
- Scaling Trends: No loss saturation across explored scales, with both and yielding robust improvements.
- Data Hunger: For code, optimal is up to higher than Chinchilla's prediction; at high compute budgets, even more data is needed.
- Super-linear Scaling: Larger models require disproportionately more data for loss minimization, contradicting the flat scaling assumption used for NL models.
Empirical analyses also demonstrate that as the model and data grow, the entropy rate for code approaches zero with context size, theoretically justifying Farseer law's irreducible loss predictions.
4. Implications for LLM Training and Compute Allocation
Using Farseer law for code LLM training mandates a departure from NL-driven strategies:
- Compute-Optimal Allocation: Solve for optimal on a fixed compute budget via Farseer's parameterization.
- Data Acquisition: Aggressively source large, high-quality code corpora; rare/complex examples are especially valuable.
- Frontier Model Scaling: As increases, must be scaled super-linearly to maintain optimal performance—expect to scale data 5–10 higher (or more) than NL practices.
- NL Data Mixing: While NL data helps small-scale or data-constrained training (acts as regularization), it degrades performance when code data and compute are sufficient: at large scales, pure code data is always optimal.
- Practical Planning: Farseer law supports precise extrapolation. For instance, ablation studies at small scales can guide planning for very large models by reliably predicting loss improvements and data requirements.
5. Application in Code vs. Natural Language Modeling
Code LLMs fundamentally differ from NL LLMs in their scaling regime:
- Code is “more data-hungry”: Loss minima are only achieved with much greater than for NL tasks.
- No Saturation Observed: Unlike NL, code models continue to benefit from increased parameter and data sizes at all explored scales.
- Uniqueness of Prefixes: Most code prefixes are unique, so next-token prediction becomes nearly deterministic as context and data grow, justifying Farseer law’s loss→0 behavior.
- Size Limitation Rationale: Code models are typically smaller in practice not due to inefficiency, but to the scarcity of sufficiently diverse code datasets.
6. Guidance for Code LLM Engineering
For practitioners designing or training code LLMs:
- Always use Farseer law (not Chinchilla or NL guidelines) for planning, budget allocation, and scaling.
- Target very high code data collection, especially rare patterns, since data quantity is often the ultimate bottleneck.
- Expect and plan for high ratios—standard NL LLM ratios are inadmissible for code modeling.
- Maintain parallel scaling of and , following the Farseer-predicted scaling curve, to avoid sub-optimal loss plateaus.
- Mix NL only under strict data scarcity, and revert to pure code as resources allow.
7. Summary Table: Farseer vs. Chinchilla for Code LLMs
| Property | Chinchilla | Farseer |
|---|---|---|
| Formula | Constant exponents | N-dependent exponents |
| Fit to Code LLM | Adequate | Superior |
| D/N Scaling | Nearly constant | Super-linear |
| Loss Limit | Nonzero | Asymptotically zero |
| Practicality | Insufficient | Accurate/Optimal |
8. Conclusion
Farseer law establishes a scalable, expressive, and empirically valid paradigm for planning, training, and extrapolating performance of code LLMs. It decisively supersedes natural language-derived scaling heuristics, showing that effective code modeling requires orders of magnitude more data and super-linear scaling regimes. By reconciling loss landscapes with code-specific structural properties and providing actionable guidance across compute budgets, Farseer law is the definitive tool for researchers and practitioners advancing the frontier of LLM-based software engineering (Luo et al., 9 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free