Scale, Parsimony & Precision Hyper-Parameters
- Scale, parsimony, and precision hyper-parameters are critical model attributes that define dimensionality, redundancy, and accuracy in machine learning systems.
- They guide tuning strategies by employing scaling laws, loss landscape geometry, and Bayesian optimization to balance model complexity and resource constraints.
- Practical applications span from LLM training to scientific ML, emphasizing trade-offs between performance, compute efficiency, and numerical precision.
Scale, parsimony, and precision hyper-parameters are a triad of model attributes that jointly determine the capacity, efficiency, and accuracy of machine learning systems. These axes—quantifying data or model dimensionality (scale), redundancy or economical representation (parsimony), and target approximation fidelity (precision)—govern theoretical bounds on performance, practical deployment, and optimization complexity. Modern analysis incorporates precise scaling laws, loss landscape geometry, optimizer schedules, and resource-aware trade-offs to select hyper-parameter regimes appropriate to task, modality, and computational constraints.
1. Formal Definitions and Notation
The interplay between scale, parsimony, and precision is formalized through distinct hyper-parameters capturing dimensional, architectural, and accuracy aspects (Michaud et al., 2022).
Scale hyper-parameters:
- : input (ambient) dimension
- : number of training examples
- : per-layer width in an MLP or similar architecture
- : network depth (number of layers)
- : total number of trainable parameters (typically for dense MLPs)
Parsimony hyper-parameters:
- : minimal parameter count required to achieve RMS loss
- Modularity structures (e.g., block-diagonal weight matrices) that reduce
Precision hyper-parameters:
- Target RMS loss:
- : machine precision floor (for 64-bit floats, )
- MSE metrics: ,
Task-dependent criteria specify hyper-parameter targets, e.g., achieving in scientific ML versus Pareto-optimal compute/accuracy in LLM pre-training (Bergsma et al., 19 May 2025).
2. Scaling Laws and Their Regimes
Scaling laws provide predictive, task-agnostic relations between hyper-parameters and metric outcomes.
Function approximation scaling (Michaud et al., 2022):
- Piecewise-polynomial/spline interpolation in dimensions, order : error scales ; for linear (),
- ReLU networks (worst case): same as splines; empirically, structured inputs enable scaling with effective arity so
LLM loss scaling (Bergsma et al., 19 May 2025, Kumar et al., 7 Nov 2024):
- Training loss: with
- Effective parameter count incorporates loss penalty for reduced precision: (: bits; : fitted constant)
- AdamW timescale:
- Batch size scaling: ,
Bayesian field priors (Spyropoulou et al., 2021):
- Marginal variance: ; precision parameter determined by target standard deviation
Scaling law selection and analysis are context- and regime-specific.
3. Loss Landscape Geometry and Optimization Implications
Loss minima at high-precision approximation generate highly degenerate Hessians; only a small subset of parameter directions exhibit large curvature (Michaud et al., 2022).
- Spectrum(Hessian): few large eigenvalues ("steep walls"), many near-zero ("flat canyon floor")
- Gradient aligns with high-curvature subspace, leading to training stagnation in flat subspace
- Boosting procedures and subspace-projected line-search methods can overcome optimizer-induced plateaus
Optimization for scale and precision demands secondary tricks:
- Switch from Adam to BFGS at low MSE
- Explicit gradient projection onto low-curvature subspaces
- Residual-fitting and block-diagonal fusion for empirical precision gains ( achievable in low-dimensional cases)
4. Prescriptive Hyper-Parameter Selection and Transfer
Scaling rules for robust hyper-parameter transfer are derived from steady-state properties and dynamic invariants (Fan et al., 17 Oct 2025, Li et al., 29 Sep 2025).
AdamW width-robust scaling:
- "Matrix-like" parameters: ,
- "Vector-like" parameters: ,
- Zero-shot transfer: scale base learning rate and weight decay from proxy width to target width as , ()
- Diagnostic: match top singular values and sublayer gains
Trajectory invariance principle:
- Training trajectory curves collapse onto invariant direction ; tune only one of by fixing the other
- Scaling law: optimal ; use batch-size warmup for invariance under large B
5. Model Parsimony versus Expressivity: Depth, Width, and Precision
Recent studies interpolate between the neural-tangent (NTK, "lazy") and maximal-update (mean-field, "feature learning") regimes via a hyper-parameter (Yaida, 2022).
- NTK scaling (): weak representation learning, stable kernels, O(1) learning rates
- Mean-field scaling (): strong representation learning, aggressive kernel evolution, learning rate grows with width
- Emergent coupling scale: ; stability requires
- Adjusting enables parsimonious () or expressive () models, with depth-precision balance
This continuum enables explicit control of numerical stability (no vanishing/exploding gradients), memory constraints, and trainability across regimes.
6. Real-World Trade-offs: Memory, Compute, and Task Sensitivities
Empirical evaluations demonstrate that required parameter count, achievable precision, and resource usage must be dynamically balanced (Badshah et al., 6 May 2024, Kumar et al., 7 Nov 2024, Li et al., 15 May 2025):
- LLM performance at fixed memory: deploy largest quantized model fitting the budget; e.g., B at 4 bits (GB) outperforms B at 32 bits at similar memory
- Quantization threshold: 4-bit viable for reasoning/NLU tasks above B; maintain for high-fidelity/factuality tasks
- For time-series forecasting, "k-level" hyper-parameters (e.g., ) allow parsimonious models to outperform M-level ones (Li et al., 15 May 2025), with adaptive component weighting and parameter-aware evaluation metrics
Trade-off strategies formalize objectives:
- Minimize , subject to , tunes scale-parsimony balance
- Under resource constraints, optimize precision for compute efficiency (train at bits, quantize to $4-6$ for inference)
7. Bayesian Optimization and Statistical Priors for Scale/Precision Selection
Efficient hyper-parameter selection for scale/precision in noisy or stochastic settings employs Bayesian surrogate modeling and closed-form optimization (Yadav et al., 7 Oct 2025, Spyropoulou et al., 2021):
- Statistical surrogate: log-linear GLM models empirical scaling of summary statistics in terms of precision parameter
- Closed-form optimal setting: for power-law surrogate
- Data efficiency: Bayesian GLM-based optimization requires 40× fewer samples than brute-force Monte Carlo
- For Gaussian Markov random fields, prior calibration on (precision) ensures interpretable marginal standard deviations, with PC and Gaussian priors adapted to empirical reference variance
This approach enables principled, interpretable, and resource-constrained tuning in both deterministic and stochastic modeling domains.
The scale/parsimony/precision triad allows for targeted, theoretically supported, and empirically validated choices of hyper-parameters, tailored to architecture, task, resource, and optimization regime. Scaling laws, optimization geometry, statistical priors, and diagnostic procedures jointly comprise a rigorous toolkit for research and practical model engineering across scientific and industrial domains.