Scale, Parsimony & Precision Hyper-Parameters

Updated 23 November 2025

Scale, parsimony, and precision hyper-parameters are critical model attributes that define dimensionality, redundancy, and accuracy in machine learning systems.
They guide tuning strategies by employing scaling laws, loss landscape geometry, and Bayesian optimization to balance model complexity and resource constraints.
Practical applications span from LLM training to scientific ML, emphasizing trade-offs between performance, compute efficiency, and numerical precision.

Scale, parsimony, and precision hyper-parameters are a triad of model attributes that jointly determine the capacity, efficiency, and accuracy of machine learning systems. These axes—quantifying data or model dimensionality (scale), redundancy or economical representation (parsimony), and target approximation fidelity (precision)—govern theoretical bounds on performance, practical deployment, and optimization complexity. Modern analysis incorporates precise scaling laws, loss landscape geometry, optimizer schedules, and resource-aware trade-offs to select hyper-parameter regimes appropriate to task, modality, and computational constraints.

1. Formal Definitions and Notation

The interplay between scale, parsimony, and precision is formalized through distinct hyper-parameters capturing dimensional, architectural, and accuracy aspects (Michaud et al., 2022).

Scale hyper-parameters:

$d$ : input (ambient) dimension
$N$ : number of training examples
$W$ : per-layer width in an MLP or similar architecture
$L$ : network depth (number of layers)
$P$ : total number of trainable parameters (typically $P\sim O(W^2L)$ for dense MLPs)

Parsimony hyper-parameters:

$P_{\min}(\epsilon)$ : minimal parameter count required to achieve RMS loss $\le\epsilon$
Modularity structures (e.g., block-diagonal weight matrices) that reduce $P_{\min}$

Precision hyper-parameters:

Target RMS loss: $\epsilon = \ell_{\mathrm{rms}}(\theta) = \sqrt{\sum_i [f_\theta(x_i) - y_i]^2 / \sum_i y_i^2}$
$\epsilon_0$ : machine precision floor (for 64-bit floats, $\epsilon_0 \approx 2^{-52} \approx 10^{-16}$ )
MSE metrics: $\ell_{\mathrm{mse}} \equiv \ell_{\mathrm{rms}}^2$ , $L_{\mathrm{mse}}(\theta) \equiv (1/|D|)\sum (f_\theta(x_i)-y_i)^2$

Task-dependent criteria specify hyper-parameter targets, e.g., achieving $\epsilon\to\epsilon_0$ in scientific ML versus Pareto-optimal compute/accuracy in LLM pre-training (Bergsma et al., 19 May 2025).

2. Scaling Laws and Their Regimes

Scaling laws provide predictive, task-agnostic relations between hyper-parameters and metric outcomes.

Piecewise-polynomial/spline interpolation in $d$ dimensions, order $n$ : error scales $\epsilon \propto P^{-(n+1)/d}$ ; for linear ( $n=1$ ), $\epsilon \propto P^{-2/d}$
ReLU networks (worst case): same as $n=1$ splines; empirically, structured inputs enable scaling with effective arity $d^*\ll d$ so $\epsilon \propto P^{-2/d^*}$

Training loss: $L(N,D,P) = A\cdot N_{\text{eff}}^{-\alpha} + B\cdot D^{-\beta} + E$ with $\alpha\approx\beta\approx 0.5$
Effective parameter count incorporates loss penalty for reduced precision: $N_{\text{eff}}(N,P) = N\cdot(1-\exp(-P/\bar\gamma))^3$ ( $P$ : bits; $\bar\gamma$ : fitted constant)
AdamW timescale: $t^* = B/(ηλD) = C_t (D/N)^{α_t}$
Batch size scaling: $B_\text{opt}(D)\propto D^{0.38}$ , $B_\text{crit}(D)\propto D^{0.51}$

Marginal variance: $\sigma(u_i) \approx \sigma_{\mathrm{ref}}/\sqrt{\tau}$ ; precision parameter $\tau$ determined by target standard deviation

Scaling law selection and analysis are context- and regime-specific.

3. Loss Landscape Geometry and Optimization Implications

Loss minima at high-precision approximation generate highly degenerate Hessians; only a small subset of parameter directions exhibit large curvature (Michaud et al., 2022).

Spectrum(Hessian): few large eigenvalues ("steep walls"), many near-zero ("flat canyon floor")
Gradient aligns with high-curvature subspace, leading to training stagnation in flat subspace
Boosting procedures and subspace-projected line-search methods can overcome optimizer-induced plateaus

Optimization for scale and precision demands secondary tricks:

Switch from Adam to BFGS at low MSE
Explicit gradient projection onto low-curvature subspaces
Residual-fitting and block-diagonal fusion for empirical precision gains ( $\epsilon\lesssim \epsilon_0$ achievable in low-dimensional cases)

4. Prescriptive Hyper-Parameter Selection and Transfer

Scaling rules for robust hyper-parameter transfer are derived from steady-state properties and dynamic invariants (Fan et al., 17 Oct 2025, Li et al., 29 Sep 2025).

AdamW width-robust scaling:

"Matrix-like" parameters: $\eta \propto d^{-1}$ , $\lambda \propto \sqrt{d}$
"Vector-like" parameters: $\eta=\Theta(1)$ , $\lambda=0$
Zero-shot transfer: scale base learning rate and weight decay from proxy width to target width as $\eta_\text{target} = \eta_\text{base}/m$ , $\lambda_\text{target} = \lambda_\text{base}\sqrt{m}$ ( $m = d_\text{target}/d_\text{proxy}$ )
Diagnostic: match top singular values and sublayer gains

Trajectory invariance principle:

Training trajectory curves collapse onto invariant direction $\gamma = \eta \lambda$ ; tune only one of $(\eta,\lambda)$ by fixing the other
Scaling law: optimal $\gamma^*(D) \propto D^{\gamma_2}$ ; use batch-size warmup for invariance under large B

5. Model Parsimony versus Expressivity: Depth, Width, and Precision

Recent studies interpolate between the neural-tangent (NTK, "lazy") and maximal-update (mean-field, "feature learning") regimes via a hyper-parameter $p\in[0,1]$ (Yaida, 2022).

NTK scaling ( $p=0$ ): weak representation learning, stable kernels, O(1) learning rates
Mean-field scaling ( $p=1$ ): strong representation learning, aggressive kernel evolution, learning rate grows with width
Emergent coupling scale: $\gamma = L/n^{1-p}$ ; stability requires $L\sim n^{1-p}$
Adjusting $p$ enables parsimonious ( $p\to0$ ) or expressive ( $p\to1$ ) models, with depth-precision balance

This continuum enables explicit control of numerical stability (no vanishing/exploding gradients), memory constraints, and trainability across regimes.

6. Real-World Trade-offs: Memory, Compute, and Task Sensitivities

Empirical evaluations demonstrate that required parameter count, achievable precision, and resource usage must be dynamically balanced (Badshah et al., 6 May 2024, Kumar et al., 7 Nov 2024, Li et al., 15 May 2025):

LLM performance at fixed memory: deploy largest quantized model fitting the budget; e.g., $N=70$ B at 4 bits ( $M=35$ GB) outperforms $N=7$ B at 32 bits at similar memory
Quantization threshold: 4-bit viable for reasoning/NLU tasks above $N\gtrsim13$ B; maintain $b\ge8$ for high-fidelity/factuality tasks
For time-series forecasting, "k-level" hyper-parameters (e.g., $k\sim \text{horizon}$ ) allow parsimonious models to outperform M-level ones (Li et al., 15 May 2025), with adaptive component weighting and parameter-aware evaluation metrics

Trade-off strategies formalize objectives:

Minimize $[\epsilon(W,L,c)] + \lambda\cdot P(W,L)$ , subject to $\epsilon(W,L,c)\le\epsilon_\text{target}$ , $\lambda$ tunes scale-parsimony balance
Under resource constraints, optimize precision for compute efficiency (train at $\sim7-8$ bits, quantize to $4-6$ for inference)

7. Bayesian Optimization and Statistical Priors for Scale/Precision Selection

Efficient hyper-parameter selection for scale/precision in noisy or stochastic settings employs Bayesian surrogate modeling and closed-form optimization (Yadav et al., 7 Oct 2025, Spyropoulou et al., 2021):

Statistical surrogate: log-linear GLM models empirical scaling of summary statistics in terms of precision parameter
Closed-form optimal setting: $θ^*=(s_0/(b\,e^{3\varepsilon^2/2}))^{1/a}$ for power-law surrogate
Data efficiency: Bayesian GLM-based optimization requires $\sim$ 40× fewer samples than brute-force Monte Carlo
For Gaussian Markov random fields, prior calibration on $\tau$ (precision) ensures interpretable marginal standard deviations, with PC and Gaussian priors adapted to empirical reference variance

This approach enables principled, interpretable, and resource-constrained tuning in both deterministic and stochastic modeling domains.

The scale/parsimony/precision triad allows for targeted, theoretically supported, and empirically validated choices of hyper-parameters, tailored to architecture, task, resource, and optimization regime. Scaling laws, optimization geometry, statistical priors, and diagnostic procedures jointly comprise a rigorous toolkit for research and practical model engineering across scientific and industrial domains.