Papers
Topics
Authors
Recent
Search
2000 character limit reached

Global Scoring Rule: Theory & Applications

Updated 26 February 2026
  • Global Scoring Rule is a method that evaluates the entire predictive distribution, providing a strictly proper score across applications like weather, language, and risk management.
  • It employs canonical metrics such as CRPS, Brier score, and multivariate extensions to capture comprehensive performance by integrating over the full forecast.
  • Despite ensuring proper scoring, global rules can yield ambiguous rankings and are sensitive to transformations, which often necessitates the complementary use of local scores like the logarithmic score.

A global scoring rule—also referred to as a nonlocal scoring rule—assigns a numerical score to a probabilistic forecast based on the entire distribution, not solely on the predicted probability at the observed outcome. Such rules are central to the theory and practice of probabilistic forecast evaluation, encompassing both general frameworks (e.g., L2L^2-kernel-based constructions), and canonical metrics like the Continuous Ranked Probability Score (CRPS), Brier score, and multivariate extensions. In contrast to local scoring rules, global scoring rules are strictly proper but may exhibit undesirable evaluation behavior, lack full interpretability, and are sensitive to transformations. Their structure and implications are critical to applications in statistics, machine learning, and practical domains such as weather prediction, language generation, and risk management.

1. Formal Definitions and Locality Principle

Let YY be a random variable taking values in R\mathbb{R} or a finite set, and suppose a probabilistic forecast is made in the form of a distribution PP (with density or mass function p()p(\cdot)). A scoring rule is a mapping

S(P,y)S(P, y)

where yy is the realized outcome. The score is typically interpreted as a loss, with lower values preferable. For probability vectors p=(p1,,pm)p = (p_1,\dots,p_m) over finite X={1,,m}X = \{1,\dots,m\}, the rule S(p,i)S(p,i) evaluates the forecast YY0 on realization YY1.

A scoring rule YY2 is strictly proper if, for any distributions YY3, it holds that

YY4

with equality only when YY5. Thus, strictly proper rules uniquely incentivize reporting the true distribution.

Locality distinguishes between rules based solely on the value YY6 and those depending on the entire forecast. Specifically, YY7 is local if YY8 for some function YY9, and nonlocal (global) otherwise. For finite R\mathbb{R}0, R\mathbb{R}1 is local if it depends only on R\mathbb{R}2; otherwise, it is global. Bernardo's theorem states that the logarithmic score is the only strictly proper local scoring rule (Du, 2020, Shao et al., 2024).

2. Canonical Global Scoring Rules

Several archetypal global (nonlocal) strictly proper scoring rules are widely used:

  • Continuous Ranked Probability Score (CRPS): For a CDF R\mathbb{R}3,

R\mathbb{R}4

This score requires integration over the full predictive distribution, not just its value at R\mathbb{R}5 (Du, 2020).

  • Ranked Probability Score (RPS): For categorical variables with R\mathbb{R}6 ordered categories:

R\mathbb{R}7

where R\mathbb{R}8. This aggregates information over all categories (Du, 2020).

  • Brier Score: For binary event prediction,

R\mathbb{R}9

While simple, the Brier score depends on PP0 and PP1 in their full structure and is not local in PP2 (Du, 2020, Shao et al., 2024).

  • Multivariate Generalizations: The PP3-kernel (or mixture) framework constructs strictly proper global rules for multivariate distributions, such as the quadratic score and multivariate CRPS:

PP4

PP5

Here, PP6 and PP7 are the density and CDF of PP8 (Meng et al., 2020).

Further generalizations include the PP9-power and pseudo-spherical scores for finite spaces: p()p(\cdot)0 with p()p(\cdot)1 and p()p(\cdot)2 recovering the Brier and spherical scores, respectively (Shao et al., 2024).

3. Theoretical Properties and Limitations

Global scoring rules, while strictly proper, exhibit several distinctive theoretical features and limitations:

  • Ranking Ambiguity: Different global strictly proper rules can rank imperfect forecasts differently. For p()p(\cdot)3, p()p(\cdot)4 (neither true), it can occur that p()p(\cdot)5 for one score, and the reverse for another, making unambiguous performance comparison impossible in the absence of the true distribution (Du, 2020).
  • Transformation Sensitivity: Nonlocal scores are generally not invariant under smooth bijective transformations p()p(\cdot)6. The rule’s value and the induced order of forecasters can change under reparameterization:

p()p(\cdot)7

This leads to potential inconsistencies across units or coordinate systems (Du, 2020).

  • Unintuitive (“Unfortunate”) Evaluations: Certain global scores, notably CRPS, may prefer forecasts that assign low probability mass at the realized outcome if other aspects of the distribution (e.g., the median) are favored. For example, CRPS is minimized when the outcome coincides with the predictive median, independent of actual assigned likelihood (Du, 2020).
  • Boundedness and Smoothing: Global rules like Brier and spherical are bounded, which affects their sensitivity to rare events and motivates the use of masked log-score penalties to enforce strict calibration and regularization (Shao et al., 2024).

4. Global Scoring Rules in Modern Machine Learning

Global strictly proper scoring rules have been adapted for high-dimensional predictive modeling, such as language generation. In these settings, the sample space is exponentially large (e.g., token sequences for LLMs). The challenge of intractable sequence-level evaluation is addressed by decomposing the global score into token-level components using the autoregressive factorization: p()p(\cdot)8 where p()p(\cdot)9 (Shao et al., 2024).

Empirical studies show that replacing the standard log-likelihood (local score) with global scores (Brier, spherical) in language generation—particularly during fine-tuning—can yield improved BLEU and ROUGE scores in machine translation and summarization. The effect is present across both Transformers and LLMs (LLaMA-7B, -13B) (Shao et al., 2024). Score smoothing techniques (convex combinations with uniform or log penalties) are employed to address the specific boundedness of global rules. The findings suggest that models trained with global rules can exhibit greater calibration or desirable tail behavior, though possibly at the cost of slower convergence or different early dynamics.

5. The S(P,y)S(P, y)0-Kernel Framework and Level Set Decomposition

For multivariate distributions, the S(P,y)S(P, y)1-kernel mixture framework provides a systematic method for constructing global scoring rules: S(P,y)S(P, y)2 where S(P,y)S(P, y)3 is a smoothing kernel and S(P,y)S(P, y)4 a weight. The divergence between two forecasts S(P,y)S(P, y)5 is the S(P,y)S(P, y)6-distance between their convolved densities, ensuring strict propriety when S(P,y)S(P, y)7 and S(P,y)S(P, y)8 satisfy certain conditions (Meng et al., 2020).

A salient feature is that such global scores admit decomposition—via the layer-cake theorem—into integrals over level-set scores. For any S(P,y)S(P, y)9, the function’s yy0-upper level set yy1 can be individually scored: yy2 This decomposition enables targeted evaluation for tasks such as anomaly detection, risk estimation (e.g., CoVaR), and combining forecasts by minimizing convex mixtures of proper scores (Meng et al., 2020).

6. Comparative Table of Scoring Rule Properties

Scoring Rule Local or Global Strictly Proper Invariant under Reparametrization Direct Probability Interpretation
Logarithmic (Ignorance) Local Yes Yes Yes (bits/info)
Brier Global Yes No No
CRPS Global Yes No No
Quadratic (L2) Global Yes No No

The only local, strictly proper rule is the logarithmic score. All others are global, strictly proper, and lack invariance under arbitrary smooth transformations (Du, 2020, Shao et al., 2024).

7. Practical Implications and Recommendations

Global scoring rules are indispensable for evaluating complex forecasts, supporting applications in forecast combination, probabilistic risk assessment, and multi-output prediction. Nevertheless, reliance on global scores alone can entail ambiguous rankings, lack of robustness to variable transformations, and potentially misleading assessments when evaluated outside their optimum. In domains such as language modeling, consideration of score-specific smoothing and calibration adaptations is critical for bounded global rules.

A consistent recommendation is that the logarithmic score—uniquely local, strictly proper, invariant, and directly interpretable—should always be reported alongside any global score in predictive performance evaluation (Du, 2020). Multi-score evaluation is advocated due to the divergent optimization trajectories and distinct distributional characteristics enforced by different strictly proper rules (Shao et al., 2024).

Monte Carlo methods are effective for the approximation of global scores in high-dimensional multivariate forecasting, broadening applicability to areas such as forecast mixing and conditional value-at-risk inference (Meng et al., 2020).

In summary, global scoring rules provide flexible, theoretically sound mechanisms for evaluating distributional forecasts beyond the realized outcome, but must be carefully interpreted and accompanied by local (logarithmic) scoring to ensure interpretability, invariance, and robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Global Scoring Rule.