Papers
Topics
Authors
Recent
2000 character limit reached

Scale, Parsimony & Precision Hyper-Parameters

Updated 23 November 2025
  • Scale, parsimony, and precision hyper-parameters are critical model attributes that define dimensionality, redundancy, and accuracy in machine learning systems.
  • They guide tuning strategies by employing scaling laws, loss landscape geometry, and Bayesian optimization to balance model complexity and resource constraints.
  • Practical applications span from LLM training to scientific ML, emphasizing trade-offs between performance, compute efficiency, and numerical precision.

Scale, parsimony, and precision hyper-parameters are a triad of model attributes that jointly determine the capacity, efficiency, and accuracy of machine learning systems. These axes—quantifying data or model dimensionality (scale), redundancy or economical representation (parsimony), and target approximation fidelity (precision)—govern theoretical bounds on performance, practical deployment, and optimization complexity. Modern analysis incorporates precise scaling laws, loss landscape geometry, optimizer schedules, and resource-aware trade-offs to select hyper-parameter regimes appropriate to task, modality, and computational constraints.

1. Formal Definitions and Notation

The interplay between scale, parsimony, and precision is formalized through distinct hyper-parameters capturing dimensional, architectural, and accuracy aspects (Michaud et al., 2022).

Scale hyper-parameters:

  • dd: input (ambient) dimension
  • NN: number of training examples
  • WW: per-layer width in an MLP or similar architecture
  • LL: network depth (number of layers)
  • PP: total number of trainable parameters (typically PO(W2L)P\sim O(W^2L) for dense MLPs)

Parsimony hyper-parameters:

  • Pmin(ϵ)P_{\min}(\epsilon): minimal parameter count required to achieve RMS loss ϵ\le\epsilon
  • Modularity structures (e.g., block-diagonal weight matrices) that reduce PminP_{\min}

Precision hyper-parameters:

  • Target RMS loss: ϵ=rms(θ)=i[fθ(xi)yi]2/iyi2\epsilon = \ell_{\mathrm{rms}}(\theta) = \sqrt{\sum_i [f_\theta(x_i) - y_i]^2 / \sum_i y_i^2}
  • ϵ0\epsilon_0: machine precision floor (for 64-bit floats, ϵ02521016\epsilon_0 \approx 2^{-52} \approx 10^{-16})
  • MSE metrics: mserms2\ell_{\mathrm{mse}} \equiv \ell_{\mathrm{rms}}^2, Lmse(θ)(1/D)(fθ(xi)yi)2L_{\mathrm{mse}}(\theta) \equiv (1/|D|)\sum (f_\theta(x_i)-y_i)^2

Task-dependent criteria specify hyper-parameter targets, e.g., achieving ϵϵ0\epsilon\to\epsilon_0 in scientific ML versus Pareto-optimal compute/accuracy in LLM pre-training (Bergsma et al., 19 May 2025).

2. Scaling Laws and Their Regimes

Scaling laws provide predictive, task-agnostic relations between hyper-parameters and metric outcomes.

  • Piecewise-polynomial/spline interpolation in dd dimensions, order nn: error scales ϵP(n+1)/d\epsilon \propto P^{-(n+1)/d}; for linear (n=1n=1), ϵP2/d\epsilon \propto P^{-2/d}
  • ReLU networks (worst case): same as n=1n=1 splines; empirically, structured inputs enable scaling with effective arity ddd^*\ll d so ϵP2/d\epsilon \propto P^{-2/d^*}
  • Training loss: L(N,D,P)=ANeffα+BDβ+EL(N,D,P) = A\cdot N_{\text{eff}}^{-\alpha} + B\cdot D^{-\beta} + E with αβ0.5\alpha\approx\beta\approx 0.5
  • Effective parameter count incorporates loss penalty for reduced precision: Neff(N,P)=N(1exp(P/γˉ))3N_{\text{eff}}(N,P) = N\cdot(1-\exp(-P/\bar\gamma))^3 (PP: bits; γˉ\bar\gamma: fitted constant)
  • AdamW timescale: t=B/(ηλD)=Ct(D/N)αtt^* = B/(ηλD) = C_t (D/N)^{α_t}
  • Batch size scaling: Bopt(D)D0.38B_\text{opt}(D)\propto D^{0.38}, Bcrit(D)D0.51B_\text{crit}(D)\propto D^{0.51}
  • Marginal variance: σ(ui)σref/τ\sigma(u_i) \approx \sigma_{\mathrm{ref}}/\sqrt{\tau}; precision parameter τ\tau determined by target standard deviation

Scaling law selection and analysis are context- and regime-specific.

3. Loss Landscape Geometry and Optimization Implications

Loss minima at high-precision approximation generate highly degenerate Hessians; only a small subset of parameter directions exhibit large curvature (Michaud et al., 2022).

  • Spectrum(Hessian): few large eigenvalues ("steep walls"), many near-zero ("flat canyon floor")
  • Gradient aligns with high-curvature subspace, leading to training stagnation in flat subspace
  • Boosting procedures and subspace-projected line-search methods can overcome optimizer-induced plateaus

Optimization for scale and precision demands secondary tricks:

  • Switch from Adam to BFGS at low MSE
  • Explicit gradient projection onto low-curvature subspaces
  • Residual-fitting and block-diagonal fusion for empirical precision gains (ϵϵ0\epsilon\lesssim \epsilon_0 achievable in low-dimensional cases)

4. Prescriptive Hyper-Parameter Selection and Transfer

Scaling rules for robust hyper-parameter transfer are derived from steady-state properties and dynamic invariants (Fan et al., 17 Oct 2025, Li et al., 29 Sep 2025).

AdamW width-robust scaling:

  • "Matrix-like" parameters: ηd1\eta \propto d^{-1}, λd\lambda \propto \sqrt{d}
  • "Vector-like" parameters: η=Θ(1)\eta=\Theta(1), λ=0\lambda=0
  • Zero-shot transfer: scale base learning rate and weight decay from proxy width to target width as ηtarget=ηbase/m\eta_\text{target} = \eta_\text{base}/m, λtarget=λbasem\lambda_\text{target} = \lambda_\text{base}\sqrt{m} (m=dtarget/dproxym = d_\text{target}/d_\text{proxy})
  • Diagnostic: match top singular values and sublayer gains

Trajectory invariance principle:

  • Training trajectory curves collapse onto invariant direction γ=ηλ\gamma = \eta \lambda; tune only one of (η,λ)(\eta,\lambda) by fixing the other
  • Scaling law: optimal γ(D)Dγ2\gamma^*(D) \propto D^{\gamma_2}; use batch-size warmup for invariance under large B

5. Model Parsimony versus Expressivity: Depth, Width, and Precision

Recent studies interpolate between the neural-tangent (NTK, "lazy") and maximal-update (mean-field, "feature learning") regimes via a hyper-parameter p[0,1]p\in[0,1] (Yaida, 2022).

  • NTK scaling (p=0p=0): weak representation learning, stable kernels, O(1) learning rates
  • Mean-field scaling (p=1p=1): strong representation learning, aggressive kernel evolution, learning rate grows with width
  • Emergent coupling scale: γ=L/n1p\gamma = L/n^{1-p}; stability requires Ln1pL\sim n^{1-p}
  • Adjusting pp enables parsimonious (p0p\to0) or expressive (p1p\to1) models, with depth-precision balance

This continuum enables explicit control of numerical stability (no vanishing/exploding gradients), memory constraints, and trainability across regimes.

6. Real-World Trade-offs: Memory, Compute, and Task Sensitivities

Empirical evaluations demonstrate that required parameter count, achievable precision, and resource usage must be dynamically balanced (Badshah et al., 6 May 2024, Kumar et al., 7 Nov 2024, Li et al., 15 May 2025):

  • LLM performance at fixed memory: deploy largest quantized model fitting the budget; e.g., N=70N=70B at 4 bits (M=35M=35GB) outperforms N=7N=7B at 32 bits at similar memory
  • Quantization threshold: 4-bit viable for reasoning/NLU tasks above N13N\gtrsim13B; maintain b8b\ge8 for high-fidelity/factuality tasks
  • For time-series forecasting, "k-level" hyper-parameters (e.g., khorizonk\sim \text{horizon}) allow parsimonious models to outperform M-level ones (Li et al., 15 May 2025), with adaptive component weighting and parameter-aware evaluation metrics

Trade-off strategies formalize objectives:

  • Minimize [ϵ(W,L,c)]+λP(W,L)[\epsilon(W,L,c)] + \lambda\cdot P(W,L), subject to ϵ(W,L,c)ϵtarget\epsilon(W,L,c)\le\epsilon_\text{target}, λ\lambda tunes scale-parsimony balance
  • Under resource constraints, optimize precision for compute efficiency (train at 78\sim7-8 bits, quantize to $4-6$ for inference)

7. Bayesian Optimization and Statistical Priors for Scale/Precision Selection

Efficient hyper-parameter selection for scale/precision in noisy or stochastic settings employs Bayesian surrogate modeling and closed-form optimization (Yadav et al., 7 Oct 2025, Spyropoulou et al., 2021):

  • Statistical surrogate: log-linear GLM models empirical scaling of summary statistics in terms of precision parameter
  • Closed-form optimal setting: θ=(s0/(be3ε2/2))1/aθ^*=(s_0/(b\,e^{3\varepsilon^2/2}))^{1/a} for power-law surrogate
  • Data efficiency: Bayesian GLM-based optimization requires \sim40× fewer samples than brute-force Monte Carlo
  • For Gaussian Markov random fields, prior calibration on τ\tau (precision) ensures interpretable marginal standard deviations, with PC and Gaussian priors adapted to empirical reference variance

This approach enables principled, interpretable, and resource-constrained tuning in both deterministic and stochastic modeling domains.


The scale/parsimony/precision triad allows for targeted, theoretically supported, and empirically validated choices of hyper-parameters, tailored to architecture, task, resource, and optimization regime. Scaling laws, optimization geometry, statistical priors, and diagnostic procedures jointly comprise a rigorous toolkit for research and practical model engineering across scientific and industrial domains.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Scale/Parsimony/Precision Hyper-Parameters.