Papers
Topics
Authors
Recent
2000 character limit reached

Scaling Behavior of Miscalibration

Updated 19 November 2025
  • Scaling behavior of miscalibration is defined by how calibration error changes with model capacity, sample size, and training interventions.
  • It quantifies miscalibration using diverse metrics such as ECE, cumulative difference norms, and CDL to capture predictive uncertainty and overconfidence.
  • Empirical scaling laws and mitigation techniques offer actionable insights for optimizing calibration in deep networks and scientific data pipelines.

Miscalibration quantifies the discrepancy between a model’s predicted probabilities and the true distribution of outcomes. The scaling behavior of miscalibration—how calibration error changes as key design or training parameters vary—has become central to the understanding and improvement of probabilistic models, deep networks, and scientific data pipelines. Scaling laws expose both the limits to calibration achievable in large-scale systems and the efficacy of modern post-hoc recalibration techniques.

1. Formal Definitions and Metrics

Miscalibration is typically assessed via metrics comparing predicted confidence and empirical accuracy. The canonical metrics include:

  • Expected Calibration Error (ECE): For a classifier ff predicting confidence p^[0,1]\hat{p} \in [0,1] for each input, ECE computes the average absolute difference between predicted confidence and empirical accuracy across bins of p^\hat{p} values, usually as

ECE=b=1BBbNacc(Bb)conf(Bb)\mathrm{ECE} = \sum_{b=1}^B \frac{|B_b|}{N}\, |\operatorname{acc}(B_b) - \operatorname{conf}(B_b)|

where BbB_b are bins, acc\operatorname{acc} and conf\operatorname{conf} are empirical accuracy and average confidence in each bin (Wang et al., 2022, Carrell et al., 2022).

  • Cumulative Difference Norms: For probabilistic predictions, the cumulative difference function

CN(p)=1Ni=1N1{p^ip}(yip^i)C_N(p) = \frac{1}{N}\sum_{i=1}^N 1\{\hat{p}_i \le p\}(y_i - \hat{p}_i)

leads to Kolmogorov–Smirnov-type (MM_\infty) and L2L^2-norm (M2M_2) scalar metrics. Under perfect calibration, both MM_\infty and M2M_2 scale as N1/2N^{-1/2} in sample size NN (Arrieta-Ibarra et al., 2022).

  • Calibration-Decision Loss (CDL): CDL measures the maximal swap regret over all payoff-bounded downstream tasks and provides a decision-theoretic guarantee: vanishing CDL implies negligible downstream utility loss due to miscalibration. CDL can scale asymptotically as O((logT)/T)O((\log T)/\sqrt{T}) with TT online time steps (Hu et al., 21 Apr 2024).
  • Entropy Calibration Error: For generative models, especially LLMs, the gap M=1T(H(p^)L(pp^))M = \left|\frac{1}{T}( H(\hat{p}) - L(p^* \| \hat{p}) ) \right| (mean per-step difference between entropy of model-sampled generations and log loss on reference text) is used to quantify error accumulation in predictive uncertainty (Cao et al., 15 Nov 2025).

Other variants such as classwise ECE, maximum calibration error (MCE), and instance-level metrics are used depending on context (Zhang et al., 19 Dec 2024).

2. Scaling Laws in Deep Learning and Classification

Scaling laws in miscalibration have been empirically and theoretically linked to the same structural variables as classical generalization: model capacity, sample size, regularization, and data diversity.

  • Model Capacity: Increasing the number of parameters or depth (e.g., SegFormer B0 to B5) increases both accuracy and ECE; larger models are systematically more overconfident absent further correction (Wang et al., 2022, Carrell et al., 2022).
  • Sample Size: In the high-data regime (nn \gg model size), both test ECE and generalization error obey O(1/n)O(1/\sqrt{n}) scaling; in overparameterized or interpolating regimes, calibration deteriorates in lockstep with the error generalization gap (Carrell et al., 2022).
  • Architectural and Training Interventions: Heavy augmentation, regularization, or smaller model size yield smaller generalization gaps, which directly limits ECE by the empirical bound

TestECETrainECETestErrorTrainError|\mathrm{TestECE} - \mathrm{TrainECE}| \le |\mathrm{TestError} - \mathrm{TrainError}|

suggesting calibration is not intrinsically distinct from generalization (Carrell et al., 2022).

  • Multiclass Scaling: In multi-class settings, the bias-variance tradeoff in recalibration becomes acute. For temperature scaling, miscalibration trends as O(K/N)O(K/N) in the number of classes KK and calibration set size NN; richer parameterizations (vector, matrix, or structured matrix scaling) can lower calibration error—if NN grows proportionally to K2K^2 to avoid overfitting (Berta et al., 5 Nov 2025).

3. Scaling of Calibration with Model and Data Properties

Empirical and theoretical studies have mapped out several robust scaling behaviors:

Source [arXiv ID] Scaling Variable(s) Scaling Law or Trend
(Carrell et al., 2022) Data size nn TestECE=O(1/n)\text{TestECE} = O(1/\sqrt{n}) in high-data regime (well-generalized models)
(Berta et al., 5 Nov 2025) Classes KK, Data NN Temperature scaling: O(K/N)O(K/N); matrix models need NK2N \gg K^2 to avoid variance blowup
(Wang et al., 2022) Model capacity, crop size ECE increases with model size; decreases with crop size and multi-scale test ensemble
(Cao et al., 15 Nov 2025) Model scale, data tail For heavy-tailed data (α1\alpha\approx1), miscalibration decays near $0$ with scale; rapid decay only for light tails
(Hu et al., 21 Apr 2024) Online time TT CDL scales O((logT)/T)O((\log T)/\sqrt{T}), while ECE cannot decay faster than O(T0.472)O(T^{-0.472})

Further, new post-hoc calibrators such as ρ\rho-Norm scaling introduce norm parameters that induce a clear U-shaped ECE curve, with a minimum at intermediate values for practical CV benchmarks (Zhang et al., 19 Dec 2024).

4. Domain-Specific Scaling: Generative Models and Science Pipelines

  • LLMs and Entropy Calibration: In LLMs, per-step entropy calibration error exhibits almost no decay with model scale or data size when the underlying data is heavy-tailed (α1\alpha \to 1 in natural text). Empirically, fitted scaling exponents β^0.05\hat{\beta}\sim 0.05 for datasets like WikiText-103 and WritingPrompts (theoretical β=11/α\beta=1-1/\alpha) indicate marginal returns to increasing model size for calibration (Cao et al., 15 Nov 2025).
  • Radio Interferometric Calibration: In scientific pipelines, the smallest eigenvalue of the calibration Jacobian dictates the amplification of miscalibration. With NN antennas, bandwidth PP, and model completeness cc, the dangerous scaling mode is 1+λmin1σC,maxσH,min+ρμF,min1+\lambda_\mathrm{min}\approx 1-\frac{\sigma_{C,\max}}{\sigma_{H,\min}+\rho\mu_{F,\min}}, where reducing cc (missing sky power) or insufficient NN and PP can lead to catastrophic amplification. Doubling NN improves robustness quadratically, while increasing PP or consensus regularization ρ\rho linearly suppresses error amplification (Yatawatta, 2019).

5. Mitigation Techniques: Structured Scaling and Selective Calibration

  • Structured Matrix Scaling (SMS): By combining hierarchical regularization over various parameter blocks, SMS interpolates between bias (underspecified model) and variance (overfitting), matching calibration error to available data and class dimensions. SMS attains lower test negative log-likelihoods and ECE on benchmarks with large KK (Berta et al., 5 Nov 2025).
  • ρ\rho-Norm Scaling: Post-hoc ρ\rho-Norm scaling achieves significant ECE reduction without sacrificing accuracy. The optimal ρ\rho sits near $1.7$–$1.8$ for standard vision tasks; too small fails to correct overconfidence, while too large induces underconfidence (Zhang et al., 19 Dec 2024).
  • Selective Scaling in Segmentation: Targeting overconfident mispredictions via learned selectors and class-specific temperature smoothing reduces ECE by 20–30% versus classical scaling, especially for high-capacity models or small input crops (Wang et al., 2022).

6. Theoretical Advances in Calibration Error Rates

Recent work on decision-theoretic calibration (Hu et al., 21 Apr 2024) separates CDL from ECE by considering the worst-case swap regret over all payoff-normalized decision tasks. While ECE is bottlenecked by a provable Ω(T0.472)\Omega(T^{-0.472}) rate in online settings, CDL achieves O((logT)/T)O((\log T)/\sqrt{T})—matching optimal learning rates for economic utility loss under miscalibration. These results highlight a fundamental distinction: some calibration metrics are too stringent for online guarantees, while task-centered regrets permit faster decay.

New cumulative difference norm-based calibration metrics achieve order-optimal O(N1/2)O(N^{-1/2}) scaling, compared with binned ECE estimators that encounter a bias–variance tradeoff and can be noise-limited (Arrieta-Ibarra et al., 2022).

7. Practical Guidelines and Interventions

Empirical results and analytical laws support several practical recommendations:

  • Increase calibration set size and data regularity whenever possible to suppress both error and ECE.
  • For multi-class (>10) or high-K tasks, prefer structured or regularized scaling methods (e.g., SMS) over vanilla temperature or vector scaling.
  • Use crop size and multi-scale ensembles in segmentation to counteract model-size-induced overconfidence (Wang et al., 2022).
  • Monitor per-bucket ECE and the smallest Jacobian eigenvalue in scientific pipelines; apply regularization or reduce model complexity where dangerous scaling is detected (Yatawatta, 2019).
  • For autoregressive text generation, anticipate persistent entropy calibration error with heavy-tailed data distributions; truncation is often necessary independent of scale (Cao et al., 15 Nov 2025).
  • Calibrators with tunable smoothness/regularization enable finding the ECE–NLL–accuracy Pareto front (e.g., tune ρ\rho in ρ\rho-Norm scaling) (Zhang et al., 19 Dec 2024).

Overall, miscalibration exhibits qualitatively distinct scaling regimes depending on statistical regime (classical, high-dimensional, generative, or domain-specific), task structure, and recalibration method. Progress in model design and calibration methodology increasingly depends on principled, quantitative characterization of these scaling laws.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Scaling Behavior of Miscalibration.