Proper Scoring Rules: Foundations & Applications

Updated 30 January 2026

Proper scoring rules are functions that evaluate probability forecasts by ensuring that the true distribution uniquely minimizes the expected loss.
They induce entropy-based divergences linked to convex analysis, with examples like the log score and Brier score providing concrete evaluation metrics.
Applications range from likelihood estimation and forecast evaluation to modern machine learning, offering robust tools for both univariate and multivariate data analysis.

A proper scoring rule is a function that quantifies the quality of a quoted probability distribution for a random variable, penalizing deviations from the true (realized) outcome such that truth-telling becomes an optimal strategy in expectation. Strict propriety means that the expected score is uniquely minimized when the forecast matches the true distribution. Proper scoring rules underpin statistical calibration, forecast evaluation, minimum-risk inference, and economic mechanism design, with deep connections to convex analysis, Bregman divergences, and information theory (Waghmare et al., 2 Apr 2025, Ovcharov, 2015, Dawid et al., 2014). Their mathematical structure and operational implications are both rich and subtle, particularly in high-dimensional, nonparametric, or structured-data regimes.

1. Foundations: Definition and Characterization

Let $Y$ be an outcome in measurable space $\mathcal Y$ with $\mathcal P$ , a convex class of probability measures, and let $S:\mathcal P \times \mathcal Y \to \mathbb R \cup \{\pm\infty\}$ be a scoring rule. The expected score of forecast $P$ under the reference (true) distribution $Q$ is

$S(P, Q) = \mathbb E_{Y \sim Q}[S(P, Y)].$

$S$ is proper if for all $P, Q \in \mathcal P$ ,

$S(Q, Q) \leq S(P, Q),$

and strictly proper if equality only holds for $P = Q$ (Waghmare et al., 2 Apr 2025).

Every proper scoring rule induces an entropy functional $H(P) = S(P, P)$ and a divergence $d(P, Q) = S(P, Q) - H(Q) \geq 0$ , often (but not always) a Bregman divergence. In finite spaces, McCarthy/Savage theory asserts that all proper scoring rules correspond, via subgradient constructions, to concave “entropies” $H$ on the simplex, and the general case (Gneiting-Raftery, Dawid) extends this with functional supergradients. If $H$ is Gâteaux-differentiable, strictly proper scoring rules are in bijection with strictly concave $H$ (Waghmare et al., 2 Apr 2025, Ovcharov, 2015).

2. Canonical Examples and Taxonomy

Logarithmic score: $S_\log(P, y) = -\log p(y)$ (for density $p$ ) has Shannon entropy, Kullback–Leibler divergence, and is local-order zero, strictly proper, and unbounded on zero-probability events. It is the foundation of MLE and information-theoretic risk (Waghmare et al., 2 Apr 2025, Dawid et al., 2014).

Brier (quadratic) score: $S_{\mathrm{Brier}}(P, y) = \sum_j (p_j - \mathbf{1}_{j=y})^2$ . Its divergence is squared $L^2$ -distance, strictly proper but less punitive than log for small $p(y)$ , and only for discrete outcomes (Waghmare et al., 2 Apr 2025, Machete, 2011, Dawid et al., 2014).

CRPS: $S_{\mathrm{CRPS}}(P, y) = \int_{-\infty}^{\infty} [F_P(x) - \mathbf{1}_{y\leq x}]^2 dx$ . It can be written as a kernel score and is strictly proper on all distributions with finite first moment. It is robust to tails, unlike log score, and indifferent to over/under-dispersion (Waghmare et al., 2 Apr 2025, Machete, 2011, Bolin et al., 2019).

Kernel and energy scores: $S_h(P, y) = \int h(y, x)dP(x) - \frac{1}{2} \iint h(x, x')dP(x)dP(x')$ for negative-definite $h$ . Energy score with $h(x,x') = \|x - x'\|^\beta$ ( $\beta \in (0,2)$ ) is strictly proper and sample-based, commonly used in multivariate forecast evaluation (Waghmare et al., 2 Apr 2025, Alexander et al., 2021, Pic et al., 2024).

Hyvärinen score: For densities $p$ on $\mathbb R^d$ , $S_{\mathrm{Hyv}}(P, y) = \Delta \log p(y) + \frac{1}{2} \|\nabla \log p(y)\|^2$ . Local of order 2, strictly proper, and normalization-constant-free; suited for unnormalized or graphical models (Parry et al., 2011, Waghmare et al., 2 Apr 2025).

General separable scores: Given a convex function $\psi:[0, \infty) \to \mathbb R$ ,

$S(x, Q) = -\psi'(q(x)) - \int [\psi(q(y)) - q(y)\psi'(q(y))] d\mu(y).$

Includes log, Brier, and Tsallis scores (Dawid et al., 2014, Ovcharov, 2015).

3. Geometric and Convex-Analytic Structure

All proper scoring rules on the simplex are subgradients (0-homogeneous extensions) of convex entropy functionals $\Phi$ (often strictly convex for strict propriety). The associated Bregman divergence

$D_\Phi(p \Vert q) = \Phi(p) - \Phi(q) - \langle \nabla\Phi(q), p-q \rangle$

coincides with $S$ ’s divergence: $D_\Phi(p \Vert q) = \Phi(p) - \mathbb E_p[S(q)]$ , so $S$ being proper is equivalent to non-negativity of $D_\Phi$ (Ovcharov, 2015, Waghmare et al., 2 Apr 2025). For general (non-smooth) entropies in infinite dimensions, the unique scoring rule is determined almost everywhere on the quasi-interior where the directional derivative is linear (Ovcharov, 2015).

Sublinear (1-homogeneous) extensions allow for scale-invariant scoring rules: E.g., on the positive cone $C$ the extension $\Phi(\lambda p) = \lambda \Phi(p)$ and $S(\lambda q) = S(q)$ (Dawid et al., 2011). This underpins many local scoring rules and normalization-constant-free inference in Markov random fields and exponential families (Dawid et al., 2011, Parry et al., 2011).

4. Domination, Coherence, and Extensions

Classical results show that for additive, continuous, strictly proper rules on finite sample spaces, any incoherent (non-probabilistic) forecast $c \notin P$ is strictly dominated by a coherent $p \in P$ : $\forall \omega,\, s(p)(\omega) > s(c)(\omega)$ (Pruss, 2021, 0710.3183). Recent work has shown this domination holds for all strictly proper rules satisfying weak continuity and denseness of finite scores on the positive-facing boundary (“no positive-facing gaps”). Additivity is not required, and finite continuity can be weakened to this geometric “density” condition, which is necessary and sufficient (Pruss, 2021). Violations occur only for pathological rules with, e.g., positive-facing holes in the convex hull of attainable score vectors.

Properization theory shows that any (potentially improper) scoring rule can be “properized” by replacing the forecast with its Bayes act under the rule—if such minimizers exist—thereby making truth-telling an optimal strategy in expectation (Brehmer et al., 2018). Existence may require compactness or coercivity conditions on $\mathcal P$ .

5. Locality, Homogeneity, and Computational Considerations

On continuous spaces, local scoring rules (depending on $q(x)$ and finitely many derivatives at $x$ ) are classified: nontrivial local strictly proper rules exist only at even order (e.g., the log score $m=0$ , Hyvärinen $m=2$ , higher-order generalizations for $m = 2t$ ). All such rules are homogeneous in $q$ , eliminating dependence on normalization constants and facilitating inference in unnormalized models (Parry et al., 2011, Dawid et al., 2011). Discrete local rules are characterized by the structure of a graph over the outcome space—every differentiable proper discrete local rule is a gradient of a concave, homogeneous, clique-additive entropy (Dawid et al., 2011).

In high-dimensional, multivariate settings, aggregation and transformation frameworks extend proper scoring to feature-based and interpretable rules (e.g., pairwise variogram scores, patch-based energy scores), targeting specific forecast attributes (margins, dependence, extremes) via composition of transformations and proper univariate scores. The combined sum is again proper; strict propriety holds when at least one transformation is injective and paired with a strictly proper rule (Pic et al., 2024).

6. Applications: Estimation, Forecast Evaluation, and Modern Domains

Minimum scoring-rule estimation generalizes maximum likelihood: one minimizes the average score over the data, yielding an $M$ -estimator that is consistent and asymptotically normal when $S$ is strictly proper (Waghmare et al., 2 Apr 2025, Dawid et al., 2014). The unbiased estimating equations extend to parametric and nonparametric models, including survival analysis (with censoring-adjusted scores) (Yanagisawa, 2023), copula estimation (using conditional KL-type scores) (Chen et al., 2022), and machine learning settings (score matching, kernel generative models) (Waghmare et al., 2 Apr 2025). Forecast evaluation employs proper scores for comparative testing, including Diebold–Mariano, Giacomini–White, and martingale-based significance frameworks.

Murphy’s decomposition interprets proper scores in terms of uncertainty, resolution, and miscalibration, paralleling bias-variance tradeoffs (Waghmare et al., 2 Apr 2025). Practitioners in meteorology, finance, and economics use distinct scores (Brier, log, CRPS, energy/variogram) depending on whether sharpness (over-confidence), spread (under-confidence), or calibrated accuracy in extremes are paramount (Machete, 2011, Waghmare et al., 2 Apr 2025).

7. Comparative Properties and Selection Criteria

While all proper scoring rules incentivize truthful reporting, they differ in penalization patterns. The log score has maximal sensitivity to over-confident (under-dispersed) forecasts, penalizing low-entropy, sharply concentrated deviations (Machete, 2011). The spherical score favors concentrated forecasts, rewarding lower entropy. Brier and CRPS-type scores are neutral to symmetric mis-calibration but less sensitive to the tails.

Selection among proper rules is context-dependent:

For penalizing over-confidence (e.g., rare catastrophic events): prefer log score.
For penalizing unnecessary under-confidence: prefer spherical score.
For neutral, symmetric evaluation: prefer Brier or CRPS (Machete, 2011).
For robustness and local scale-invariance (e.g., variable-uncertainty settings): use scale-invariant generalizations such as scaled CRPS (SCRPS), or robustified kernel scores (Bolin et al., 2019).

In high-dimensional or structured problems, combine proper scoring rules via transformation and aggregation to build interpretable, feature-sensitive verifications (e.g., marginals, spatial patterns, extremes, dependence) (Pic et al., 2024, Alexander et al., 2021). For calibration of probabilistic classifiers, non-parametric transformations such as the PAV algorithm are optimal for all regular binary proper scoring rules (Brummer et al., 2013).

References:

Key sources providing the above results include (Waghmare et al., 2 Apr 2025, Ovcharov, 2015, Dawid et al., 2014, Pruss, 2021, Machete, 2011, Bolin et al., 2019, Pic et al., 2024, Parry et al., 2011, Dawid et al., 2011, Yanagisawa, 2023, Chen et al., 2022, Brummer et al., 2013, 0710.3183), and (Brehmer et al., 2018).