ERCE: Ranking Calibration Error for LMs

Updated 9 November 2025

ERCE is a metric that assesses rank-calibration by comparing the ordering of uncertainty scores with expected correctness in language model outputs.
It computes the average deviation between ranks, using bin-based averaging of uncertainty and correctness measures for robust evaluation.
ERCE provides a threshold-free evaluation that decouples from absolute performance, enabling fine-grained analysis of LM uncertainty and calibration.

Expected Ranking Calibration Error (ERCE) is a metric designed to quantify calibration quality of uncertainty measures for LLM (LM) outputs, specifically focusing on the relative ranking of uncertainty vis-à-vis generation quality. ERCE addresses the limitations of threshold-dependent and range-sensitive calibration metrics, providing a unified, principled approach for evaluating how well an uncertainty or confidence measure predicts expected correctness in natural language generation tasks (Huang et al., 4 Apr 2024).

1. Rank-Calibration: Formal Definition and Motivation

Rank-calibration is the foundational principle underlying ERCE. Given a query $x$ sampled from a data distribution, an LM generates a response $\widehat y \sim P(\cdot \mid x)$ . For each $(x, \widehat y)$ pair, two quantities are computed:

$A(x; \widehat y) \in \mathbb{R}$ : a correctness metric (e.g., ROUGE-L, METEOR, or human rating),
$U(x; \widehat y)$ : an uncertainty measure (higher implies greater uncertainty).

The calibration (regression) function is defined as: $\mathrm{reg}(u) = \mathbb{E}[A(x; \widehat y) \mid U(x; \widehat y) = u]$

The rank-calibration property requires that lower uncertainty ( $u$ ) values correspond to higher expected generation quality, and that the rank of $U$ and $\mathrm{reg}(U)$ are mirror images: $\Pr(U \le u) = \Pr(\mathrm{reg}(U) \ge \mathrm{reg}(u)), \quad \text{for all } u \text{ in the support of } U$ An uncertainty measure $U$ that exactly satisfies this is termed rank-calibrated.

2. Mathematical Formulation of ERCE

ERCE quantifies the average magnitude of deviation from the ideal rank-calibration property. Let $U$ be the uncertainty score of a sample and $U'$ an independent copy. The population ERCE is defined as: $\mathrm{ERCE} = \mathbb{E}_{U} \left[ \left| \Pr_{U'}( \mathrm{reg}(U') \ge \mathrm{reg}(U)) - \Pr_{U'}( U' \le U ) \right| \right]$ This evaluates, across the distribution of uncertainty values, how far the practical ranking of the uncertainty measure disagrees with the implied ranking from expected correctness.

A finite-sample estimator of ERCE proceeds as follows:

Collect $n$ i.i.d. pairs $(u_i, a_i)$ , for $i = 1, \ldots, n$ .
Sort and partition $u_i$ ’s into $B$ equal-mass bins.
For bin $b$ $b$ , compute:
- binwise average uncertainty: $\mathrm{uct}_b = \frac{1}{|I_b|}\sum_{i \in I_b} u_i$
- binwise average correctness: $\mathrm{crc}_b = \frac{1}{|I_b|}\sum_{i \in I_b} a_i$
For any sample $i$ $i$ in bin $b$ $b$ :
- $\widehat{\Pr}(U' \le u_i) = \frac{1}{B-1}\sum_{b' \ne b} \mathbf{1}[ \mathrm{uct}_{b'} \le \mathrm{uct}_b ]$
- $\widehat{\Pr}(\mathrm{reg}(U') \ge \mathrm{reg}(u_i)) = \frac{1}{B-1}\sum_{b' \ne b} \mathbf{1}[ \mathrm{crc}_{b'} \ge \mathrm{crc}_b ]$
The empirical ERCE is: $\widehat{\mathrm{ERCE}} = \frac{1}{n} \sum_{i=1}^n \left| \widehat{\Pr}(\mathrm{reg}(U') \ge \mathrm{reg}(u_i)) - \widehat{\Pr}(U' \le u_i) \right|$

This formulation is invariant to monotonic transformations and the range of $U$ , focusing solely on rank relationships.

3. Theoretical Properties and Connection to Classical Calibration

ERCE generalizes the classical calibration error to ranking settings with non-binary correctness measures. Key characterizations include:

Theorem 1 (Equivalence to classical calibration for binary correctness): If $A \in \{0,1\}$ , then $\mathrm{ERCE} = 0$ if and only if a strictly decreasing function $g$ exists such that $g(U)$ is classically calibrated: $\Pr(A=1 \mid g(U) = c) = c$ .
The converse holds: any calibrated confidence $C$ yields a rank-calibrated measure $h(C)$ for any strictly decreasing $h$ .
Proposition 1 (ERCE is independent of ECE): For any pair $(\alpha, \beta) \in (0, 1/2]^2$ , one can construct $C$ with $\mathrm{ERCE}(C) = \alpha$ and standard Expected Calibration Error $\mathrm{ECE}(C) = \beta$ .

These statements underline that ERCE and ECE capture distinct aspects of calibration: ERCE is purely about rank-consistency between uncertainty estimates and expected correctness, whereas ECE measures absolute probability calibration.

4. Algorithmic Procedure for Computing ERCE

A practical computation of ERCE for a dataset proceeds as follows:

1. Sort indices i by increasing u_i.
2. Partition indices into B bins of nearly equal size: I_b for b=1..B.
3. For each bin b:
   uct[b] = (1/|I_b|) * sum_{i in I_b} u_i
   crc[b] = (1/|I_b|) * sum_{i in I_b} a_i
4. For each bin b:
   frac_u_le[b]  = (1/(B-1)) * sum_{b' != b} 1[uct[b'] <= uct[b]]
   frac_reg_ge[b]= (1/(B-1)) * sum_{b' != b} 1[crc[b'] >= crc[b]]
5. ERCE = 0
   For each bin b:
      For each i in I_b:
         ERCE += abs(frac_reg_ge[b] - frac_u_le[b])
   ERCE = ERCE / n

The procedure leverages binwise averages for robust, interpretable rank assignments, and computes deviations between the fractional rank of uncertainty and the corresponding ranking induced by correctness.

5. Empirical Findings Across LLMs and Uncertainty Measures

ERCE was applied across four benchmarks (TriviaQA, Natural Questions, SQuAD, Meadow) and for three LMs (Llama-2-7b, Llama-2-7b-chat, GPT-3.5-turbo). Six representative uncertainty/confidence measures were evaluated:

$U_\mathrm{NLL}$ (negative log-likelihood)
$U_\mathrm{SE}$ (semantic entropy)
$U_\mathrm{EigV}$ , $U_\mathrm{Deg}$ , $U_\mathrm{Ecc}$ (affinity-graph-based measures)
$C_\mathrm{Verb}$ (verbalized confidence)

Representative ERCE values on TriviaQA (Rouge-L correctness, GPT-3.5) are as follows:

Uncertainty/Confidence Measure	ERCE Score
$U_\mathrm{NLL}$	$\approx 0.037$
$U_\mathrm{SE}$	$\approx 0.051$
$U_\mathrm{Deg}$	$\approx 0.050$
$U_\mathrm{EigV}$	$\approx 0.065$
$U_\mathrm{Ecc}$	$\approx 0.151$
$C_\mathrm{Verb}$	$\approx 0.487$

Negative log-likelihood uncertainty yields the lowest (best) ERCE, semantic entropy is competitive, while affinity-graph eccentricity and prompt-based verbal confidences perform worst. These trends persist across datasets and LM architectures.

6. Advantages Over Threshold-Based and Range-Sensitive Metrics

ERCE addresses several persistent challenges in evaluating LM uncertainty:

No correctness threshold: Unlike metrics such as ECE, AUROC, or AUPRC that rely on thresholded or binarized correctness (e.g., ROUGE $\geq \tau$ ), ERCE operates on raw, continuous scores, thus avoiding threshold sensitivity.
Range-invariant comparison: By construction, ERCE depends only on the ranks of $U$ and $\mathrm{reg}(U)$ , making it suitable for measures defined over $[0, \infty)$ (e.g., entropy), $[0,1]$ (e.g., degeneracy), or any other range.
Decoupling from absolute LM performance: High overall correctness does not automatically yield low ERCE. Instead, ERCE directly probes the rank-ordered fidelity of uncertainty.
Granular interpretability: Binwise differences between uncertainty and correctness ranks can be visualized (as “indication diagrams”), locally diagnosing systematic over- or under-confidence.

In essence, ERCE provides a unified, direct, and threshold-free assessment of how well an uncertainty measure ranks outputs by their expected correctness, filling a critical methodological gap in the calibration of modern LLMs for generation tasks (Huang et al., 4 Apr 2024).

PDF Markdown Chat (Pro)

References (1)

Uncertainty in Language Models: Assessment through Rank-Calibration (2024)

Follow Topic

Get notified by email when new papers are published related to Expected Ranking Calibration Error (ERCE).