Bounded Accuracy Scoring Metric

Updated 30 November 2025

The paper introduces a strictly proper scoring rule that uses a gain-and-loss function to assess individual probability estimates with clear calibration orientation.
The metric is defined via a piecewise, bounded function that assigns scores from -1 to 1, ensuring an absolute baseline and hyperparameter-free evaluation.
Empirical results on esports and synthetic datasets demonstrate that the Balance score outperforms traditional metrics like Brier and ECE in bias and computational efficiency.

A Bounded Accuracy-Based Scoring Metric, exemplified by the "Balance score," is a single-instance, strictly proper scoring rule for probability estimation models. Developed in response to limitations of existing metrics in probabilistic classification and calibration assessment, this metric is notable for boundedness, known optimum, calibration orientation, and evaluation granularity. Its primary application addresses win probability estimation models in esports, but it is positioned as suitable for general probability estimation contexts due to its robust theoretical and empirical properties (Choi et al., 2023).

1. Formal Definition and Functional Form

Let $n$ be the number of test instances, $\hat p_i\in[0,1]$ the predicted probability for instance $i$ , $y_i\in\{0,1\}$ the observed label, and $\hat y_i = \mathbf{1}\{\hat p_i \geq 0.5\}$ the thresholded class decision. The Balance score averages a pointwise gain-and-loss function $f_{ba}(\hat p,y)$ :

$\text{Balance score} = \frac{1}{n} \sum_{i=1}^n f_{ba}(\hat p_i, y_i),$

where

$f_{ba}(\hat p, y) = \begin{cases} 1 - \hat p, & \hat p \geq 0.5,\; y = 1 \ \hat p, & \hat p < 0.5,\; y = 0 \ -\hat p, & \hat p \geq 0.5,\; y = 0 \ -1 + \hat p, & \hat p < 0.5,\; y = 1 \end{cases}$

Alternatively, using indicator notation: $f_{ba}(\hat p, y) = (2\,\mathbf{1}\{y = \hat y\} - 1)\left[\hat y (1 - \hat p) + (1 - \hat y)\hat p\right]$

This construction provides a geometric interpretation, where the score is linear in $\hat p$ on each region defined by the high/low-confidence regime and correctness.

2. Boundedness and Score Interpretation

For all $\hat p, y$ , $f_{ba}(\hat p, y)$ satisfies

$-1 \leq f_{ba}(\hat p, y) \leq 1$

ensuring the global boundedness $-1 \leq \text{Balance score} \leq 1$ . The extremal values correspond to high-confidence, incorrect predictions ( $f_{ba} = -1$ at $\hat p = 0$ or $1$ with a wrong label), and the best-case gain of $0.5$ occurs at $\hat p = 0.5$ when the decision is correct. The ideal value for a perfectly calibrated model is 0.

3. Desired Metric Properties

A robust probability estimation metric is expected to satisfy criteria enumerated in Table 1 of (Choi et al., 2023):

Property	Requirement	Satisfied by Balance
Properness	Unique maximizer at true $p$	Yes
Single-instance evaluation	No grouping/binning required	Yes
Known optimum	The perfectly calibrated value is specified	Yes
Absolute scoring	Comparable across datasets	Yes
No hyperparameters	Does not require e.g. number of bins	Yes
Calibration measurement	Truly reflects calibration, not just discrimination	Yes

Properness is proven via the expected score $g(q; p) = p f_{ba}(q, 1) + (1-p) f_{ba}(q, 0)$ , which has a unique zero at $q = p$ and deviation $|g(q; p)| = |q - p|$ otherwise. Calibration orientation comes from the fact that $f_{ba}$ yields a zero value if and only if the predicted and actual probabilities coincide.

4. Comparative Analysis: Brier Score and Expected Calibration Error

Brier Score

$\mathrm{Brier} = \frac{1}{n} \sum_i (\hat p_i - y_i)^2$

The Brier score is strictly proper, free of hyperparameters, but its optimum depends on the underlying true probability distribution, precluding direct comparability across datasets.

Expected Calibration Error (ECE)

$\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} \Bigl| \overline{y}(B_m) - \overline{\hat p}(B_m) \Bigr|$

where $M$ is the number of bins and $B_m$ are bin assignments. ECE has a known optimum (0 for perfect calibration) and is absolute, but it requires selection of $M$ and does not operate at the single-instance level.

Relationship to True ECE

Given $\Pr(\hat Y=1|\hat P=p) = q_p$ , the true expected calibration error is $\mathbb{E}[|q_p - p|]$ . The Balance score approximates this without binning: for overconfident models, $\mathrm{True\, ECE} \approx -\text{Balance score}$ , and for underconfident models, $\mathrm{True\, ECE} \approx +\text{Balance score}$ . This bin-free property underpins its bias advantage over ECE.

5. Proof Sketches of Calibration and Discrimination

The properness of the Balance score is established by:

$g(q; p) = p f_{ba}(q, 1) + (1-p) f_{ba}(q, 0),$

yielding $g(p; p) = 0$ and $g(q; p) = q - p$ or $p - q$ for $q \neq p$ , so the zero point is unique. The magnitude $|g(q; p)| = |q - p|$ is a direct measure of calibration error. For discrimination, $f_{ba}$ changes only at second order for small $\hat p$ perturbations that do not affect class decision, so ranking by Balance closely aligns with that by accuracy, but with penalization for systematic miscalibration.

6. Empirical Evaluation: Simulation and Real Data

Synthetic Data

When $p_i \sim \mathrm{Beta}(\alpha, \alpha)$ and $y_i \sim \mathrm{Bernoulli}(p_i)$ , with $\hat p_i = p_i$ (optimal calibration), metrics behave as follows:

Accuracy fluctuates depending on the $p_i$ distribution (69–82%);
Brier score corresponds to spread but is not distribution-invariant (0.124–0.200);
ECE $(M=10)$ and Balance score both concentrate near 0, but Balance exhibits lower bias due to the absence of binning.

Esports Data

On 100,000 “League of Legends” matches (logistic regression model using 14 features):

Accuracy increases (65.6% → 73.6% → 79.8%) as the game proceeds,
Brier decreases (0.216 → 0.176 → 0.139),
ECE and Balance are consistently near zero for perfectly calibrated models: ECE ≈ 0.0068, Balance ≈ −0.0016 at 10 minutes.

Only ECE and Balance are centered at zero for perfectly calibrated models, providing meaningful baseline anchoring, while accuracy and Brier drift with the underlying class difficulty distribution.

Binning Sensitivity and Sample Size

Synthetic overconfident models (bias 0.10 vs. 0.11) reveal that ECE's ranking can flip under different $M$ , demonstrating sensitivity to binning specification. In contrast, Balance converges rapidly to the true ECE value with smaller $n$ .

7. Practical Computation and Implementation

Given prediction and label arrays, the Balance score can be computed in linear time without binning or sorting. The pseudocode is:

def compute_balance_score(p_hat, y):
    n = len(y)
    total = 0.0
    for i in range(n):
        p = p_hat[i]
        yi = y[i]
        if p >= 0.5:
            if yi == 1:
                s = 1 - p
            else:
                s = -p
        else:
            if yi == 0:
                s = p
            else:
                s = -1 + p
        total += s
    return total / n

The Balance score returns an absolute, calibration-oriented evaluation with ideal value 0, strictly properness, and no hyperparameter sensitivity. All claims and evaluations are supported by simulation and real-world esports datasets as detailed in (Choi et al., 2023).

PDF Markdown Chat (Pro)

References (1)

Rethinking Evaluation Metric for Probability Estimation Models Using Esports Data (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Bounded Accuracy-Based Scoring Metric.