Global Scoring Rule: Theory & Applications
- Global Scoring Rule is a method that evaluates the entire predictive distribution, providing a strictly proper score across applications like weather, language, and risk management.
- It employs canonical metrics such as CRPS, Brier score, and multivariate extensions to capture comprehensive performance by integrating over the full forecast.
- Despite ensuring proper scoring, global rules can yield ambiguous rankings and are sensitive to transformations, which often necessitates the complementary use of local scores like the logarithmic score.
A global scoring rule—also referred to as a nonlocal scoring rule—assigns a numerical score to a probabilistic forecast based on the entire distribution, not solely on the predicted probability at the observed outcome. Such rules are central to the theory and practice of probabilistic forecast evaluation, encompassing both general frameworks (e.g., -kernel-based constructions), and canonical metrics like the Continuous Ranked Probability Score (CRPS), Brier score, and multivariate extensions. In contrast to local scoring rules, global scoring rules are strictly proper but may exhibit undesirable evaluation behavior, lack full interpretability, and are sensitive to transformations. Their structure and implications are critical to applications in statistics, machine learning, and practical domains such as weather prediction, language generation, and risk management.
1. Formal Definitions and Locality Principle
Let be a random variable taking values in or a finite set, and suppose a probabilistic forecast is made in the form of a distribution (with density or mass function ). A scoring rule is a mapping
where is the realized outcome. The score is typically interpreted as a loss, with lower values preferable. For probability vectors over finite , the rule evaluates the forecast 0 on realization 1.
A scoring rule 2 is strictly proper if, for any distributions 3, it holds that
4
with equality only when 5. Thus, strictly proper rules uniquely incentivize reporting the true distribution.
Locality distinguishes between rules based solely on the value 6 and those depending on the entire forecast. Specifically, 7 is local if 8 for some function 9, and nonlocal (global) otherwise. For finite 0, 1 is local if it depends only on 2; otherwise, it is global. Bernardo's theorem states that the logarithmic score is the only strictly proper local scoring rule (Du, 2020, Shao et al., 2024).
2. Canonical Global Scoring Rules
Several archetypal global (nonlocal) strictly proper scoring rules are widely used:
- Continuous Ranked Probability Score (CRPS): For a CDF 3,
4
This score requires integration over the full predictive distribution, not just its value at 5 (Du, 2020).
- Ranked Probability Score (RPS): For categorical variables with 6 ordered categories:
7
where 8. This aggregates information over all categories (Du, 2020).
- Brier Score: For binary event prediction,
9
While simple, the Brier score depends on 0 and 1 in their full structure and is not local in 2 (Du, 2020, Shao et al., 2024).
- Multivariate Generalizations: The 3-kernel (or mixture) framework constructs strictly proper global rules for multivariate distributions, such as the quadratic score and multivariate CRPS:
4
5
Here, 6 and 7 are the density and CDF of 8 (Meng et al., 2020).
Further generalizations include the 9-power and pseudo-spherical scores for finite spaces: 0 with 1 and 2 recovering the Brier and spherical scores, respectively (Shao et al., 2024).
3. Theoretical Properties and Limitations
Global scoring rules, while strictly proper, exhibit several distinctive theoretical features and limitations:
- Ranking Ambiguity: Different global strictly proper rules can rank imperfect forecasts differently. For 3, 4 (neither true), it can occur that 5 for one score, and the reverse for another, making unambiguous performance comparison impossible in the absence of the true distribution (Du, 2020).
- Transformation Sensitivity: Nonlocal scores are generally not invariant under smooth bijective transformations 6. The rule’s value and the induced order of forecasters can change under reparameterization:
7
This leads to potential inconsistencies across units or coordinate systems (Du, 2020).
- Unintuitive (“Unfortunate”) Evaluations: Certain global scores, notably CRPS, may prefer forecasts that assign low probability mass at the realized outcome if other aspects of the distribution (e.g., the median) are favored. For example, CRPS is minimized when the outcome coincides with the predictive median, independent of actual assigned likelihood (Du, 2020).
- Boundedness and Smoothing: Global rules like Brier and spherical are bounded, which affects their sensitivity to rare events and motivates the use of masked log-score penalties to enforce strict calibration and regularization (Shao et al., 2024).
4. Global Scoring Rules in Modern Machine Learning
Global strictly proper scoring rules have been adapted for high-dimensional predictive modeling, such as language generation. In these settings, the sample space is exponentially large (e.g., token sequences for LLMs). The challenge of intractable sequence-level evaluation is addressed by decomposing the global score into token-level components using the autoregressive factorization: 8 where 9 (Shao et al., 2024).
Empirical studies show that replacing the standard log-likelihood (local score) with global scores (Brier, spherical) in language generation—particularly during fine-tuning—can yield improved BLEU and ROUGE scores in machine translation and summarization. The effect is present across both Transformers and LLMs (LLaMA-7B, -13B) (Shao et al., 2024). Score smoothing techniques (convex combinations with uniform or log penalties) are employed to address the specific boundedness of global rules. The findings suggest that models trained with global rules can exhibit greater calibration or desirable tail behavior, though possibly at the cost of slower convergence or different early dynamics.
5. The 0-Kernel Framework and Level Set Decomposition
For multivariate distributions, the 1-kernel mixture framework provides a systematic method for constructing global scoring rules: 2 where 3 is a smoothing kernel and 4 a weight. The divergence between two forecasts 5 is the 6-distance between their convolved densities, ensuring strict propriety when 7 and 8 satisfy certain conditions (Meng et al., 2020).
A salient feature is that such global scores admit decomposition—via the layer-cake theorem—into integrals over level-set scores. For any 9, the function’s 0-upper level set 1 can be individually scored: 2 This decomposition enables targeted evaluation for tasks such as anomaly detection, risk estimation (e.g., CoVaR), and combining forecasts by minimizing convex mixtures of proper scores (Meng et al., 2020).
6. Comparative Table of Scoring Rule Properties
| Scoring Rule | Local or Global | Strictly Proper | Invariant under Reparametrization | Direct Probability Interpretation |
|---|---|---|---|---|
| Logarithmic (Ignorance) | Local | Yes | Yes | Yes (bits/info) |
| Brier | Global | Yes | No | No |
| CRPS | Global | Yes | No | No |
| Quadratic (L2) | Global | Yes | No | No |
The only local, strictly proper rule is the logarithmic score. All others are global, strictly proper, and lack invariance under arbitrary smooth transformations (Du, 2020, Shao et al., 2024).
7. Practical Implications and Recommendations
Global scoring rules are indispensable for evaluating complex forecasts, supporting applications in forecast combination, probabilistic risk assessment, and multi-output prediction. Nevertheless, reliance on global scores alone can entail ambiguous rankings, lack of robustness to variable transformations, and potentially misleading assessments when evaluated outside their optimum. In domains such as language modeling, consideration of score-specific smoothing and calibration adaptations is critical for bounded global rules.
A consistent recommendation is that the logarithmic score—uniquely local, strictly proper, invariant, and directly interpretable—should always be reported alongside any global score in predictive performance evaluation (Du, 2020). Multi-score evaluation is advocated due to the divergent optimization trajectories and distinct distributional characteristics enforced by different strictly proper rules (Shao et al., 2024).
Monte Carlo methods are effective for the approximation of global scores in high-dimensional multivariate forecasting, broadening applicability to areas such as forecast mixing and conditional value-at-risk inference (Meng et al., 2020).
In summary, global scoring rules provide flexible, theoretically sound mechanisms for evaluating distributional forecasts beyond the realized outcome, but must be carefully interpreted and accompanied by local (logarithmic) scoring to ensure interpretability, invariance, and robustness.