Bounded Accuracy Scoring Metric
- The paper introduces a strictly proper scoring rule that uses a gain-and-loss function to assess individual probability estimates with clear calibration orientation.
- The metric is defined via a piecewise, bounded function that assigns scores from -1 to 1, ensuring an absolute baseline and hyperparameter-free evaluation.
- Empirical results on esports and synthetic datasets demonstrate that the Balance score outperforms traditional metrics like Brier and ECE in bias and computational efficiency.
A Bounded Accuracy-Based Scoring Metric, exemplified by the "Balance score," is a single-instance, strictly proper scoring rule for probability estimation models. Developed in response to limitations of existing metrics in probabilistic classification and calibration assessment, this metric is notable for boundedness, known optimum, calibration orientation, and evaluation granularity. Its primary application addresses win probability estimation models in esports, but it is positioned as suitable for general probability estimation contexts due to its robust theoretical and empirical properties (Choi et al., 2023).
1. Formal Definition and Functional Form
Let be the number of test instances, the predicted probability for instance , the observed label, and the thresholded class decision. The Balance score averages a pointwise gain-and-loss function :
where
Alternatively, using indicator notation:
This construction provides a geometric interpretation, where the score is linear in on each region defined by the high/low-confidence regime and correctness.
2. Boundedness and Score Interpretation
For all , satisfies
ensuring the global boundedness . The extremal values correspond to high-confidence, incorrect predictions ( at or $1$ with a wrong label), and the best-case gain of $0.5$ occurs at when the decision is correct. The ideal value for a perfectly calibrated model is 0.
3. Desired Metric Properties
A robust probability estimation metric is expected to satisfy criteria enumerated in Table 1 of (Choi et al., 2023):
| Property | Requirement | Satisfied by Balance |
|---|---|---|
| Properness | Unique maximizer at true | Yes |
| Single-instance evaluation | No grouping/binning required | Yes |
| Known optimum | The perfectly calibrated value is specified | Yes |
| Absolute scoring | Comparable across datasets | Yes |
| No hyperparameters | Does not require e.g. number of bins | Yes |
| Calibration measurement | Truly reflects calibration, not just discrimination | Yes |
Properness is proven via the expected score , which has a unique zero at and deviation otherwise. Calibration orientation comes from the fact that yields a zero value if and only if the predicted and actual probabilities coincide.
4. Comparative Analysis: Brier Score and Expected Calibration Error
Brier Score
The Brier score is strictly proper, free of hyperparameters, but its optimum depends on the underlying true probability distribution, precluding direct comparability across datasets.
Expected Calibration Error (ECE)
where is the number of bins and are bin assignments. ECE has a known optimum (0 for perfect calibration) and is absolute, but it requires selection of and does not operate at the single-instance level.
Relationship to True ECE
Given , the true expected calibration error is . The Balance score approximates this without binning: for overconfident models, , and for underconfident models, . This bin-free property underpins its bias advantage over ECE.
5. Proof Sketches of Calibration and Discrimination
The properness of the Balance score is established by:
yielding and or for , so the zero point is unique. The magnitude is a direct measure of calibration error. For discrimination, changes only at second order for small perturbations that do not affect class decision, so ranking by Balance closely aligns with that by accuracy, but with penalization for systematic miscalibration.
6. Empirical Evaluation: Simulation and Real Data
Synthetic Data
When and , with (optimal calibration), metrics behave as follows:
- Accuracy fluctuates depending on the distribution (69–82%);
- Brier score corresponds to spread but is not distribution-invariant (0.124–0.200);
- ECE and Balance score both concentrate near 0, but Balance exhibits lower bias due to the absence of binning.
Esports Data
On 100,000 “League of Legends” matches (logistic regression model using 14 features):
- Accuracy increases (65.6% → 73.6% → 79.8%) as the game proceeds,
- Brier decreases (0.216 → 0.176 → 0.139),
- ECE and Balance are consistently near zero for perfectly calibrated models: ECE ≈ 0.0068, Balance ≈ −0.0016 at 10 minutes.
Only ECE and Balance are centered at zero for perfectly calibrated models, providing meaningful baseline anchoring, while accuracy and Brier drift with the underlying class difficulty distribution.
Binning Sensitivity and Sample Size
Synthetic overconfident models (bias 0.10 vs. 0.11) reveal that ECE's ranking can flip under different , demonstrating sensitivity to binning specification. In contrast, Balance converges rapidly to the true ECE value with smaller .
7. Practical Computation and Implementation
Given prediction and label arrays, the Balance score can be computed in linear time without binning or sorting. The pseudocode is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def compute_balance_score(p_hat, y): n = len(y) total = 0.0 for i in range(n): p = p_hat[i] yi = y[i] if p >= 0.5: if yi == 1: s = 1 - p else: s = -p else: if yi == 0: s = p else: s = -1 + p total += s return total / n |
The Balance score returns an absolute, calibration-oriented evaluation with ideal value 0, strictly properness, and no hyperparameter sensitivity. All claims and evaluations are supported by simulation and real-world esports datasets as detailed in (Choi et al., 2023).