Papers
Topics
Authors
Recent
2000 character limit reached

Bounded Accuracy Scoring Metric

Updated 30 November 2025
  • The paper introduces a strictly proper scoring rule that uses a gain-and-loss function to assess individual probability estimates with clear calibration orientation.
  • The metric is defined via a piecewise, bounded function that assigns scores from -1 to 1, ensuring an absolute baseline and hyperparameter-free evaluation.
  • Empirical results on esports and synthetic datasets demonstrate that the Balance score outperforms traditional metrics like Brier and ECE in bias and computational efficiency.

A Bounded Accuracy-Based Scoring Metric, exemplified by the "Balance score," is a single-instance, strictly proper scoring rule for probability estimation models. Developed in response to limitations of existing metrics in probabilistic classification and calibration assessment, this metric is notable for boundedness, known optimum, calibration orientation, and evaluation granularity. Its primary application addresses win probability estimation models in esports, but it is positioned as suitable for general probability estimation contexts due to its robust theoretical and empirical properties (Choi et al., 2023).

1. Formal Definition and Functional Form

Let nn be the number of test instances, p^i[0,1]\hat p_i\in[0,1] the predicted probability for instance ii, yi{0,1}y_i\in\{0,1\} the observed label, and y^i=1{p^i0.5}\hat y_i = \mathbf{1}\{\hat p_i \geq 0.5\} the thresholded class decision. The Balance score averages a pointwise gain-and-loss function fba(p^,y)f_{ba}(\hat p,y):

Balance score=1ni=1nfba(p^i,yi),\text{Balance score} = \frac{1}{n} \sum_{i=1}^n f_{ba}(\hat p_i, y_i),

where

fba(p^,y)={1p^,p^0.5,  y=1 p^,p^<0.5,  y=0 p^,p^0.5,  y=0 1+p^,p^<0.5,  y=1f_{ba}(\hat p, y) = \begin{cases} 1 - \hat p, & \hat p \geq 0.5,\; y = 1 \ \hat p, & \hat p < 0.5,\; y = 0 \ -\hat p, & \hat p \geq 0.5,\; y = 0 \ -1 + \hat p, & \hat p < 0.5,\; y = 1 \end{cases}

Alternatively, using indicator notation: fba(p^,y)=(21{y=y^}1)[y^(1p^)+(1y^)p^]f_{ba}(\hat p, y) = (2\,\mathbf{1}\{y = \hat y\} - 1)\left[\hat y (1 - \hat p) + (1 - \hat y)\hat p\right]

This construction provides a geometric interpretation, where the score is linear in p^\hat p on each region defined by the high/low-confidence regime and correctness.

2. Boundedness and Score Interpretation

For all p^,y\hat p, y, fba(p^,y)f_{ba}(\hat p, y) satisfies

1fba(p^,y)1-1 \leq f_{ba}(\hat p, y) \leq 1

ensuring the global boundedness 1Balance score1-1 \leq \text{Balance score} \leq 1. The extremal values correspond to high-confidence, incorrect predictions (fba=1f_{ba} = -1 at p^=0\hat p = 0 or $1$ with a wrong label), and the best-case gain of $0.5$ occurs at p^=0.5\hat p = 0.5 when the decision is correct. The ideal value for a perfectly calibrated model is 0.

3. Desired Metric Properties

A robust probability estimation metric is expected to satisfy criteria enumerated in Table 1 of (Choi et al., 2023):

Property Requirement Satisfied by Balance
Properness Unique maximizer at true pp Yes
Single-instance evaluation No grouping/binning required Yes
Known optimum The perfectly calibrated value is specified Yes
Absolute scoring Comparable across datasets Yes
No hyperparameters Does not require e.g. number of bins Yes
Calibration measurement Truly reflects calibration, not just discrimination Yes

Properness is proven via the expected score g(q;p)=pfba(q,1)+(1p)fba(q,0)g(q; p) = p f_{ba}(q, 1) + (1-p) f_{ba}(q, 0), which has a unique zero at q=pq = p and deviation g(q;p)=qp|g(q; p)| = |q - p| otherwise. Calibration orientation comes from the fact that fbaf_{ba} yields a zero value if and only if the predicted and actual probabilities coincide.

4. Comparative Analysis: Brier Score and Expected Calibration Error

Brier Score

Brier=1ni(p^iyi)2\mathrm{Brier} = \frac{1}{n} \sum_i (\hat p_i - y_i)^2

The Brier score is strictly proper, free of hyperparameters, but its optimum depends on the underlying true probability distribution, precluding direct comparability across datasets.

Expected Calibration Error (ECE)

ECE=m=1MBmny(Bm)p^(Bm)\mathrm{ECE} = \sum_{m=1}^M \frac{|B_m|}{n} \Bigl| \overline{y}(B_m) - \overline{\hat p}(B_m) \Bigr|

where MM is the number of bins and BmB_m are bin assignments. ECE has a known optimum (0 for perfect calibration) and is absolute, but it requires selection of MM and does not operate at the single-instance level.

Relationship to True ECE

Given Pr(Y^=1P^=p)=qp\Pr(\hat Y=1|\hat P=p) = q_p, the true expected calibration error is E[qpp]\mathbb{E}[|q_p - p|]. The Balance score approximates this without binning: for overconfident models, TrueECEBalance score\mathrm{True\, ECE} \approx -\text{Balance score}, and for underconfident models, TrueECE+Balance score\mathrm{True\, ECE} \approx +\text{Balance score}. This bin-free property underpins its bias advantage over ECE.

5. Proof Sketches of Calibration and Discrimination

The properness of the Balance score is established by:

g(q;p)=pfba(q,1)+(1p)fba(q,0),g(q; p) = p f_{ba}(q, 1) + (1-p) f_{ba}(q, 0),

yielding g(p;p)=0g(p; p) = 0 and g(q;p)=qpg(q; p) = q - p or pqp - q for qpq \neq p, so the zero point is unique. The magnitude g(q;p)=qp|g(q; p)| = |q - p| is a direct measure of calibration error. For discrimination, fbaf_{ba} changes only at second order for small p^\hat p perturbations that do not affect class decision, so ranking by Balance closely aligns with that by accuracy, but with penalization for systematic miscalibration.

6. Empirical Evaluation: Simulation and Real Data

Synthetic Data

When piBeta(α,α)p_i \sim \mathrm{Beta}(\alpha, \alpha) and yiBernoulli(pi)y_i \sim \mathrm{Bernoulli}(p_i), with p^i=pi\hat p_i = p_i (optimal calibration), metrics behave as follows:

  • Accuracy fluctuates depending on the pip_i distribution (69–82%);
  • Brier score corresponds to spread but is not distribution-invariant (0.124–0.200);
  • ECE(M=10)(M=10) and Balance score both concentrate near 0, but Balance exhibits lower bias due to the absence of binning.

Esports Data

On 100,000 “League of Legends” matches (logistic regression model using 14 features):

  • Accuracy increases (65.6% → 73.6% → 79.8%) as the game proceeds,
  • Brier decreases (0.216 → 0.176 → 0.139),
  • ECE and Balance are consistently near zero for perfectly calibrated models: ECE ≈ 0.0068, Balance ≈ −0.0016 at 10 minutes.

Only ECE and Balance are centered at zero for perfectly calibrated models, providing meaningful baseline anchoring, while accuracy and Brier drift with the underlying class difficulty distribution.

Binning Sensitivity and Sample Size

Synthetic overconfident models (bias 0.10 vs. 0.11) reveal that ECE's ranking can flip under different MM, demonstrating sensitivity to binning specification. In contrast, Balance converges rapidly to the true ECE value with smaller nn.

7. Practical Computation and Implementation

Given prediction and label arrays, the Balance score can be computed in linear time without binning or sorting. The pseudocode is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def compute_balance_score(p_hat, y):
    n = len(y)
    total = 0.0
    for i in range(n):
        p = p_hat[i]
        yi = y[i]
        if p >= 0.5:
            if yi == 1:
                s = 1 - p
            else:
                s = -p
        else:
            if yi == 0:
                s = p
            else:
                s = -1 + p
        total += s
    return total / n

The Balance score returns an absolute, calibration-oriented evaluation with ideal value 0, strictly properness, and no hyperparameter sensitivity. All claims and evaluations are supported by simulation and real-world esports datasets as detailed in (Choi et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Bounded Accuracy-Based Scoring Metric.