Strictly Consistent Scoring Functions
- Strictly consistent scoring functions are rigorous tools that uniquely minimize expected scores when the predictive distribution matches the true data-generating law.
- They employ convexity principles, mixture representations, and identification functions to derive proper scoring rules for functionals like the mean, quantile, and expectile.
- Applications span forecast verification, risk management, and model comparison, with extensions addressing robust evaluation and invariant transformations.
A strictly consistent scoring function, also called a strictly proper scoring rule, is a mathematical device for evaluating forecast quality in probabilistic and point prediction, with the defining property that the expected score is uniquely minimized (or, for positively oriented conventions, maximized) when the predictive distribution or point forecast matches the statistical functional of the true data-generating law. This uniquely incentivizes truthful reporting of one's beliefs or estimates, providing a rigorous foundation for forecast ranking, calibration, elicitation, and model comparison across a spectrum of applications in statistics, machine learning, risk management, and the empirical sciences.
1. Mathematical Definition and Core Properties
Let be a random variable with law in a class . Let be a statistical functional (e.g., mean, quantile, expectile, risk metric). A scoring function is called -consistent for if
for all . It is strictly -consistent if equality holds only for ; i.e., is the unique minimizer of the expected score for any (Fissler et al., 2022); this property is also known as strict propriety. A functional is elicitable if there exists a strictly consistent scoring function for it (Fissler et al., 2017).
For distributional forecasts, a proper (strictly proper) scoring rule satisfies for all , with strict inequality if (Guan, 2021). This ensures, for probabilistic forecasts, that honest forecast reporting cannot be improved upon in expectation by any other distribution.
Strictly consistent scoring functions guarantee that a forecaster's best possible strategy is to report their true model or point forecast, uniquely fostering incentive alignment and coherent comparison frameworks.
2. Structural Characterizations and Classes
Under regularity assumptions, all strictly consistent scoring functions for a given elicitable functional can be represented in explicit parametric or mixture forms, often mediated by identification functions and convexity principles (Fissler et al., 2017, Fissler et al., 2022). Key cases include:
A. Mean functional: Any strictly consistent score for the mean has the Bregman form: where is strictly convex (Fissler et al., 2022). The squared error (with ) is a canonical example, and positive homogeneity further restricts to the Patton family (Miao et al., 6 Sep 2024).
B. Quantile functional (Value-at-Risk):
where is strictly increasing; recovers the pinball (or “tick”) loss (Ehm et al., 2015).
C. Expectile functional:
with strictly convex (Ehm et al., 2015). Asymmetric squared error arises for .
D. Multivariate functionals: Osband’s principle yields,
for an identification function and (positive definite) matrix function (Fissler et al., 2017, Fissler et al., 2022).
E. Mixture/Choquet representations: Every strictly consistent scoring function for quantiles/expectiles can be written as a mixture over extremal elements, allowing representation as integrals against suitable nonnegative measures—playing a crucial role in constructing Murphy diagrams for universal forecast comparison (Ehm et al., 2015, Fissler et al., 2022).
3. Regularity, Uniqueness, and Transformation Principles
Strict consistency is sensitive to regularity conditions; the existence of a unique minimizer requires strict convexity (for mean/expectile), strict increase (for quantile), or full support in the measure underlying the mixture representation (Ehm et al., 2015, Pruss, 2021).
Variable transformations generate new strictly consistent scores via the “revelation principle”: for a monotonic , is strictly consistent for , extending the reach of scoring functions across transformed domains (Tyralis et al., 23 Feb 2025).
Equivariance and order-sensitivity further refine the class of admissible scores. For example, translation invariance and metrical order-sensitivity uniquely select the squared error for the mean and pinball loss for quantiles; only specific subclasses satisfy such invariance for vector-valued risk measures or higher-dimensional functionals (Fissler et al., 2017).
Existence and construction can be achieved by the Bayes-act (“properization”) principle: any scoring rule can be made strictly proper by evaluating it at the Bayes act (the forecast minimizing expected score under the data-generating distribution), provided uniqueness holds (Brehmer et al., 2018).
4. Applications and Extended Domains
Strictly consistent scoring functions are foundational in:
- Forecast verification and comparison: Used in ranking models and forecast calibration, including in multivariate and structured prediction settings with (possibly nonlocal) scores such as the log score, Brier score, or energy score (Ziel et al., 2019, Shao et al., 29 May 2024, Tan et al., 2021).
- Risk management: Elicitability and strictly consistent scoring underpin risk measure backtesting and robustification, where functionals (e.g., Value-at-Risk, Expected Shortfall) are ranked using appropriately consistent scores, some of which are required to be homogeneous for scale-invariant sensitivity measures (Miao et al., 6 Sep 2024, Fissler et al., 2022).
- Classification and top-list predictions: Strictly consistent scores characterize elicitable set-valued procedures (e.g., top- lists), readily constructed from symmetric proper scoring rules such as the Brier score (Resin, 2023).
- Semi-parametric and survival models: Strictly proper adaptations of the log score apply in right-censored competing risks settings, retaining unique identification of the true model even under non-informative censoring (Guan, 2021).
- Language modeling: Strictly proper rules beyond the log-score (e.g., Brier, spherical) have been shown to improve empirical performance in autoregressive LLMs via token-level loss decomposition, confirming their principled optimization properties (Shao et al., 29 May 2024).
5. Practical Construction and Diagnostics
Systematic construction, verification, and selection of strictly consistent scores follow key technical principles:
- Checklist for verifying strict consistency: Confirm elicitability of the functional, explicit minimization at , existence of a strict identification function, strict convexity (or monotonicity) of the expected score, and domain regularity (e.g., coercivity, compactness, or full support) (Brehmer et al., 2018, Miao et al., 6 Sep 2024).
- Murphy diagrams: By decomposing any score into elementary scores (via the mixture representation), one can visually and quantitatively assess whether one forecast uniformly dominates another across all strictly consistent scores for the target functional (Ehm et al., 2015, Fissler et al., 2022).
- Tail- and region-sensitive variants: Weighted mixtures or component selection in the kernel representation yield strictly consistent scores emphasizing specific regions (e.g., upper tail, extremes), preserving incentive properties in targeted evaluation scenarios (Taggart, 2021).
6. Robustness, Homogeneity, and Extensions
Extensions to robust settings are enabled by formulating worst-case expected scores over KL-divergence neighborhoods of the reference distribution. Robust elicitable functionals retain their uniqueness and strict identification as long as the scoring function’s convexity and regularity constraints are preserved (Miao et al., 6 Sep 2024).
Homogeneous strictly consistent scores (i.e., ) are particularly important for scale-invariant applications and underlie families of scores tailored to invariant, robust, or economic settings (Miao et al., 6 Sep 2024, Fissler et al., 2022).
Further, the theory accommodates variable transformations—whether on realizations, predictions, or jointly—providing a unifying framework for the systematic development of new scores and the reinterpretation of training objectives in applied machine learning (e.g., -transformed expectiles, log-transformed means) (Tyralis et al., 23 Feb 2025).
7. Theoretical and Empirical Impact
Strictly consistent scoring functions constitute the mathematical backbone for elicitable functional identification, robust estimation, decision-theoretically coherent model comparison, and incentive-aligned forecast evaluation. Their rigorous characterization enables design of evaluation metrics that guarantee unique minimization, robustness to transformations, and practical interpretability via mixture/Choquet decompositions. Modern applications from advanced risk management to LLM optimization explicitly rely on these foundational principles, underscoring their centrality in contemporary statistics and machine learning (Fissler et al., 2017, Guan, 2021, Shao et al., 29 May 2024).