Empirical Proper Scoring Rules

Updated 17 November 2025

Empirical proper scoring rules are quantitative tools that assign numerical scores to probabilistic forecasts, ensuring that the expected score is minimized when forecasts align with the true distribution.
They decompose scores into entropy and divergence terms, supporting classical metrics like log-loss, Brier score, and CRPS for robust model validation.
Practical implementations use Monte Carlo, resampling, and kernel-based methods to calibrate forecasts, discriminate between models, and align with decision-making tasks.

Empirical proper scoring rules are quantitative methods for evaluating probabilistic forecasts by assigning numerical scores based on both observed outcomes and forecasted probability distributions. These rules are termed "proper" if, in expectation, they incentivize honest reporting of the forecaster's true beliefs about the underlying probability. The empirical aspect refers to the assessment using finite data samples, resampled ensembles, or simulation-based techniques in practical settings.

1. Mathematical Foundations of Proper Scoring Rules

Let $\mathcal{P}$ denote a convex class of probability distributions on an outcome space $\mathcal{Y}$ . An empirical proper scoring rule is a function $S:\mathcal{P} \times \mathcal{Y}\to\overline{\mathbb{R}},\;(P,y)\mapsto S(P,y)$ such that, for any true distribution $Q\in\mathcal{P}$ , the expected score $S(Q,Q)$ is minimal among all possible forecasts,

$S(Q,Q) \le S(P,Q) \quad \forall P \in \mathcal{P}$

and strict when $P=Q$ exclusively for strictly proper rules (Waghmare et al., 2 Apr 2025).

Every proper scoring rule admits a canonical entropy-divergence decomposition (Hofman et al., 28 May 2025): $L_\ell(q,p) = H_\ell(p) + D_\ell(q \Vert p)$ with $H_\ell(p)$ as the entropy term and $D_\ell(q\Vert p)$ as the divergence, typically a Bregman divergence. This establishes a representation

$S(P,y) = H(P) + \langle h_P, \delta_y - P \rangle$

where $h_P$ is a supergradient or the subdifferential of a concave entropy $H:\mathcal{P} \to \mathbb{R}$ (Waghmare et al., 2 Apr 2025).

Classical strictly proper rules include the log-loss, Brier score, continuous ranked probability score (CRPS), and energy/variogram scores for multivariate/ensemble forecasts (Alexander et al., 2021, Machete, 2011).

2. Empirical Implementation and Estimation

In practice, probabilistic forecasts $P_i$ are paired with observed outcomes $y_i$ to compute the average empirical score

$\widehat S_n = \frac{1}{n} \sum_{i=1}^n S(P_i, y_i)$

for validation or comparative purposes (Waghmare et al., 2 Apr 2025, Bolin et al., 2019). For parametric estimation, the empirical risk minimization principle applies: $\hat\theta = \arg\min_{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^n S(P_\theta, y_i)$ yielding minimum-score estimators with asymptotic normality and consistency properties when the model is well-specified (Dawid et al., 2014).

Kernel-based scores (CRPS, energy, variogram) can be efficiently computed through Monte Carlo simulation, resampling, or by closed-form expressions for parametric families: $\mathrm{CRPS}(P,y) = \int |x-y|\,P(dx) - \frac{1}{2} \iint |x-x'|\,P(dx)\,P(dx')$ and can be extended to forecast ensembles, uncertain observations, and summary statistics for point processes (Bolin et al., 2019, Heinrich-Mertsching et al., 2021).

For categorical models and neural networks, empirical estimation of calibration error and refinement using Bregman decompositions is achieved with kernel or binning approaches (Popordanoska et al., 2023).

3. Discrimination and Decision Alignment

A central aspect of empirical scoring rules is their ability to discriminate between competing forecasts and align with downstream decision tasks. Discrimination metrics include mean relative score, error rate, and a generalized discrimination heuristic quantifying how well a proper score distinguishes the true model from alternatives in finite samples (Alexander et al., 2021):

Mean relative score: averages scores relative to the benchmark
Error rate: frequency with which a misspecified forecast outperforms the benchmark
Discrimination heuristic: normalized separation among models

Empirical studies show variogram scores with $p=0.5$ deliver optimal discrimination and robustness, outperforming the energy score and higher-order variogram scores in high-dimensional financial datasets (Alexander et al., 2021).

Score optimality is context-sensitive: the best empirical score depends on the application:

For selective prediction, total uncertainty under the task loss yields minimal area-under-loss curves.
Out-of-distribution detection is best served by epistemic uncertainty under log-loss (mutual information).
Active learning gains are maximized by epistemic uncertainty under zero-one loss (Hofman et al., 28 May 2025).

4. Robustness, Scale Invariance, and Adaptivity

Empirical proper scoring rules vary in their sensitivity to outliers, scale, and model misspecification:

Local Scale Invariance: Only certain generalized kernel scores (e.g., scaled CRPS) are exactly adaptive to varying uncertainty, preventing undue influence from extreme or heterogeneous observations (Bolin et al., 2019).
Robustness: Truncated kernels and bounded score variants achieve bounded sensitivity index, ensuring stability under heavy-tailed noise:

$\mathrm{rCRPS}(P,y) = \tfrac{1}{2} \mathbb{E}_P \mathbb{E}_P [g_c(X,X')] - \mathbb{E}_P[g_c(X,y)]$

where $g_c$ is a truncated kernel.

Practical Adaptations: Weighted and tailored scores, including threshold-weighting, allow targeting specific aspects of the distribution (e.g., tails, extremes, summary statistics) (Waghmare et al., 2 Apr 2025, Barczy, 2019, Bolin et al., 2019).

Applications validate these properties: scaled CRPS and its robust variants outperform standard CRPS and log-score in spatial cross-validation, stochastic volatility estimation, and regression contexts (Bolin et al., 2019). Anderson–Darling-type proper scores extend tail sensitivity by weighting CDF regions (Barczy, 2019).

The calibration–refinement decomposition quantifies model performance into a calibration error (Bregman divergence of forecasted and empirical frequencies), and refinement/sharpness (entropy or f-divergence among class-conditional distributions) (Popordanoska et al., 2023): $\mathrm{Risk} = \mathrm{CalibrationError} + \mathrm{Refinement}$ Consistent, asymptotically unbiased estimators exist for both quantities under general conditions. KL-calibration error and squared calibration error require distinct post-hoc calibration strategies (temperature scaling vs isotonic regression, respectively), with empirical evidence demonstrating trade-offs and optimality depending on the chosen score (Popordanoska et al., 2023).

Refinement equals a multi-distribution $f$ -divergence and exhibits monotonicity under neural network mapping layers—a property that generalizes classical information bottleneck principles to arbitrary proper scoring rules (Popordanoska et al., 2023).

6. Limitations, Controversy, and Practical Guidance

No single empirical proper scoring rule is universally optimal. The selection should be matched to the metric most relevant to the decision or inference task, as explicit comparisons show that Brier, log, spherical, and CRPS have differing biases—e.g., log-loss penalizes over-confidence, spherical penalizes over-dispersion; Brier and CRPS are entropy-neutral (Machete, 2011, Martin et al., 2020). The classical theory breaks down for second-order scoring rules targeting epistemic uncertainty over distributions of distributions—no strictly proper second-order scoring rule exists under reasonable regularity (Bengs et al., 2023).

Empirical AUC fails propriety generically for probabilistic vector forecasts unless the rank-sum function is modified—fixing the denominator or using unstandardized Wilcoxon–Mann–Whitney $U$ restores strict propriety in ranking contexts (Byrne, 2015).

Practical recommendations:

Always use strictly proper scoring rules to ensure incentivized truth-telling.
In empirical assessment, report at least two scores (e.g., energy + variogram(0.5) for multivariate forecasts) for robustness (Alexander et al., 2021).
Tailor the score to the end-user's risk profile (e.g., log-loss for rare-event caution, spherical for concentrated forecasts).
When calibration is critical, select the score aligned with the loss of the downstream task and the calibration error of interest (Popordanoska et al., 2023, Hofman et al., 28 May 2025).
Compute empirical scores using sampling-based (Monte Carlo, bootstrap, ensemble) or closed-form techniques, adapting for sample size and computational constraints.

7. Applications and Impact

Empirical proper scoring rules are central to:

Uncertainty quantification (total/aleatoric/epistemic; selective prediction, OoD detection) (Hofman et al., 28 May 2025).
Model validation and selection in high-dimensional, nonparametric, or simulation-based settings (multivariate financial forecasting, point process prediction, spatial statistics, deep learning calibration) (Alexander et al., 2021, Heinrich-Mertsching et al., 2021, Popordanoska et al., 2023).
M-estimation, robust inference, and confidence region construction via minimum-score estimation and adjusted test statistics (Dawid et al., 2014).
Adapting and properizing improper scores for specialized contexts (e.g., median scoring, Anderson-Darling-type rules, spread-error) (Brehmer et al., 2018, Barczy, 2019).

Future research includes extending empirical scoring methods to high-dimensional functional data, developing scalable algorithms for kernel-based and summary-statistic scores, and exploring weaker forms of properness for complex uncertainty quantification settings (Bengs et al., 2023).