Scaling-Law Guided Search
- Scaling-Law Guided Search is a methodology that leverages power-law relationships between model size, data volume, and performance to predict optimal configurations.
- It employs pilot experiments and log–log regression to extrapolate scaling curves, reducing computational cost by up to 100× compared to exhaustive grid searches.
- The approach applies across domains—sequential recommendation, LLM fine-tuning, and test-time inference—delivering measurable gains in efficiency and predictive accuracy.
Scaling-Law Guided (SLG) Search is a class of algorithmic methodologies that exploit empirical scaling laws—typically power-law relations between model size, dataset size, and performance metrics—to efficiently and systematically allocate resources in neural network training, model selection, and test-time inference. SLG Search has emerged independently across distinct domains, such as sequential recommendation models, resource-constrained LLM selection, and test-time reward optimization. These techniques obviate brute-force grid search by leveraging fitted scaling-law curves to predict optimal or near-optimal configurations under various compute, data, or evaluation budgets.
1. Foundational Principles and Motivation
The inception of SLG Search is rooted in the observation that many neural model classes (Transformers, LLMs, recommender models) exhibit smooth, empirically measurable scaling laws: the dependence of loss or reward on model size, dataset size, or sampling budget follows parametric power-law relationships. By quantifying these dependencies in small pilot regimes, practitioners can infer the marginal utility of scaling—and allocate resources beyond the pilot regime—to maximize target metrics subject to constraints.
This paradigm addresses two persistent challenges:
- Resource allocation: Given fixed training compute or inference budget, how should practitioners split between model size and data volume, or among multiple candidate models?
- Selection and extrapolation: How can one efficiently select models or states likely to provide optimal downstream results, without exhaustive trial and error?
Early methodologies relied on monolithic interpretation of power-law scaling, finding that train/test loss falls as (model size ) or (data size ), with slow, predictable diminishing returns. The current generation of SLG Search builds upon these forms, introducing mechanisms for accurate extrapolation, phase-transition detection, and optimal resource allocation (Zhang et al., 2023, Lin et al., 2024, Li et al., 1 Feb 2026).
2. Scaling-Law Modeling in Sequential Recommendation
Zhang et al. (Zhang et al., 2023) provide a detailed and validated framework for SLG Search in large sequential recommender models. The core insight is that cross-entropy test loss in decoder-only ID-based Transformers can be represented as the sum of two power-law terms:
where is non-embedding parameter count, is total interaction (data) count, and are empirical exponents, are characteristic scales, and is the irreducible error floor. The fitting procedure entails:
- Training a small number (3–5) of pilot models at varying and .
- Log-transforming and/or and applying least-squares regression in log–log space.
- Extracting and , which typically satisfy , notably larger than for LLMs (), implying faster returns to scaling in this context.
Empirical validation is provided by extrapolating fitted curves from “small-to-medium” models (up to 9M parameters) to previously untested scales (up to 829M parameters), with predicted and observed loss matching within 1–2%. This demonstrates that the scaling regime is robust across multiple orders of magnitude and can be exploited for resource allocation (Zhang et al., 2023).
3. SLG Search in Model Selection and Fine-Tuning
The proliferation of pre-trained LLMs presents the challenge of efficiently identifying which model to fine-tune, especially when brute-force tuning is prohibitive. "Selecting LLM to Fine-tune via Rectified Scaling Law" (Lin et al., 2024) formalizes this as a prediction task: using limited fine-tuning on small data subsets, estimate a model's potential full-data performance, then select the model with minimum predicted loss.
A central observation is that, unlike pre-training, the fine-tuning loss curve in log–log space exhibits a two-phase structure: a "pre-power" regime (initial, with large and decreasing slope) and a "power phase" (linear/power-law). The authors show that standard single-phase power laws fail to capture this regime transition. The Rectified Scaling Law is introduced:
where is the subset size, captures equivalent pre-learned downstream data from pre-training, and are scalars, and is the fine-tuning exponent. This law enables accurate prediction (mean squared error in log–log fit ) of extrapolated fine-tuning loss.
The SLG Search ("Accept then Stop", AtS) algorithm operates as follows:
- Fine-tune each candidate model on progressively smaller subsets , recording .
- Fit a linear model in log–log space once enough (typically ) points are gathered and the curve enters the power law regime.
- Stop iterating when the latest point deviates by more than a set threshold from the fitted trend (pre-power regime exit).
- Use the fitted curve to predict full-data performance for .
- Select the model with the lowest predicted loss.
SLG Search achieves reduction in compute over naïve methods (e.g., pilot budget results in of full-tuning compute per model), and maintains high selection quality (relative accuracy , Pearson correlation – with true orderings) (Lin et al., 2024).
4. Scaling-Law Guided Search for Test-Time Inference
For stochastic LLMs, test-time strategies such as "best-of-" (BoN) sample multiple completions and select the highest-rewarded. "Predicting and improving test-time scaling laws via reward tail-guided search" (Li et al., 1 Feb 2026) extends SLG Search to this setting, departing from uniform resource allocation in favor of tail-extrapolated estimate-based allocation.
Given prompt and model , generating an intermediate state , the reward of the terminal response (post rollout) follows empirical distribution . SLG Search leverages the following steps:
- Tail-extrapolation: Model the upper tail of as Gaussian; collect pilot completions, extract the tail (top ), compute sample mean and variance, and invert truncated-normal moment formulas to estimate .
- Predict scaling law: For large ,
predicts the maximum expected reward from samples.
- Two-stage resource allocation:
- Exploration: For intermediate states , sample pilot rollouts and compute .
- Exploitation: Allocate all remaining budget to , sample from , return best seen reward.
This approach, under mild conditions, guarantees that SLG Search not only achieves vanishing regret versus the perfect-information oracle as , but also delivers polynomial compute amplification over flat BoN—i.e., matches BoN at using only samples. Empirical validation on math reasoning (AMC, AIME) with contemporary LLMs shows consistent and significant gains (e.g., 29% total-reward gain on AIME2024+1B at over BoN) (Li et al., 1 Feb 2026).
5. Comparative Methodology and Implementation
The essential workflow in SLG Search—across domains—follows this pattern:
- Pilot Fitting:
- Train or evaluate on a set of small models, subsets, or states.
- Record relevant target metrics (loss, reward) at each scale.
- Scaling Law Inference:
- Fit parametric forms (single, or two-phase power-law, rectified, or tail models) to the empirical data.
- Validate goodness-of-fit ( typical for loss scaling; RMS error checks for reward).
- Predictive Extrapolation:
- Using the fitted law, infer the optimal resource allocation (model size , data size , or state selection) under budgetary constraints.
- For sequential recommendation under , compute:
Resource Application and Validation:
- Allocate resources per the prediction.
- Train or evaluate, validate achieved metric against either the law (for loss, within 2%) or actual ranking (for reward/model selection).
- Practical Enhancements:
- For training large models, implement stability advances (e.g., layer-wise adaptive dropout, Adam→SGD).
- For low-data or cold-start regimes, interpret as “effective unique interactions epochs” and monitor diminishing returns from data repetition.
- For multiple model candidates, parallelize the SLG Search process (Zhang et al., 2023, Lin et al., 2024, Li et al., 1 Feb 2026).
6. Empirical Outcomes and Limitations
Research has established several robust empirical findings:
- Prediction reliability: SLG-fitted curves predict loss/reward at previously untested large scales to within 1–2% (recommendation), or rank order with 95% accuracy (LLM selection).
- Resource efficiency: Compute savings of – are typical versus exhaustive methods.
- Task-specific gains: In sequential recommendation, larger models disproportionally improve outcomes for cold-start, long-tail, adversarial, and cross-domain settings.
- Regret guarantees: In test-time inference, SLG Search achieves vanishing regret relative to perfect-information oracles, and outpaces “flat” best-of- selection by polynomial factors in .
However, several caveats remain:
- Regime validity: Extrapolation is reliable only within the range validated by pilot sweeps; for much larger than fitted, embedding collapse or exponent drift may occur.
- Phase identification: Accurate two-phase modeling is critical in fine-tuning selection; single-phase laws underperform in regimes exhibiting pre-power knees.
- Data sparsity edge effects: For extremely small , additional heuristics (e.g., increasing or more sensitive deviation thresholds) may be necessary to avoid underfitting or premature stopping (Zhang et al., 2023, Lin et al., 2024).
7. Practical Guidelines and Cross-Domain Implications
Implementation of SLG Search is straightforward under the provided recipes:
| Application domain | Pilot phase | Scaling law fit | Resource solve | Empirical gain |
|---|---|---|---|---|
| Sequential reco. (Zhang et al., 2023) | 3–5 models, varying / | Power-law sum | under | 1–2% fit error; no grid |
| LLM selection (Lin et al., 2024) | Subset fine-tunes per model | Rectified two-phase | AtS, select by predicted loss | compute cut |
| Test-time LLM inference (Li et al., 1 Feb 2026) | Pilot rollouts per state | Gaussian-tail extrap. | Stagewise allocation | reward lift |
In all cases, SLG Search provides a principled, statistically grounded mechanism for converting initial pilot regime measurements into actionable resource allocation at scale, drastically reducing experimental cost while maintaining predictivity and control over scaling behavior.
A plausible implication is that, as scaling laws are further generalized and refined, SLG Search and its variants will become a foundational ingredient in neural model development pipelines across domains.