Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 130 tok/s
Gemini 3.0 Pro 29 tok/s Pro
Gemini 2.5 Flash 145 tok/s Pro
Kimi K2 191 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

ERCE: Ranking Calibration Error for LMs

Updated 9 November 2025
  • ERCE is a metric that assesses rank-calibration by comparing the ordering of uncertainty scores with expected correctness in language model outputs.
  • It computes the average deviation between ranks, using bin-based averaging of uncertainty and correctness measures for robust evaluation.
  • ERCE provides a threshold-free evaluation that decouples from absolute performance, enabling fine-grained analysis of LM uncertainty and calibration.

Expected Ranking Calibration Error (ERCE) is a metric designed to quantify calibration quality of uncertainty measures for LLM (LM) outputs, specifically focusing on the relative ranking of uncertainty vis-à-vis generation quality. ERCE addresses the limitations of threshold-dependent and range-sensitive calibration metrics, providing a unified, principled approach for evaluating how well an uncertainty or confidence measure predicts expected correctness in natural language generation tasks (Huang et al., 4 Apr 2024).

1. Rank-Calibration: Formal Definition and Motivation

Rank-calibration is the foundational principle underlying ERCE. Given a query xx sampled from a data distribution, an LM generates a response y^P(x)\widehat y \sim P(\cdot \mid x). For each (x,y^)(x, \widehat y) pair, two quantities are computed:

  • A(x;y^)RA(x; \widehat y) \in \mathbb{R}: a correctness metric (e.g., ROUGE-L, METEOR, or human rating),
  • U(x;y^)U(x; \widehat y): an uncertainty measure (higher implies greater uncertainty).

The calibration (regression) function is defined as: reg(u)=E[A(x;y^)U(x;y^)=u]\mathrm{reg}(u) = \mathbb{E}[A(x; \widehat y) \mid U(x; \widehat y) = u]

The rank-calibration property requires that lower uncertainty (uu) values correspond to higher expected generation quality, and that the rank of UU and reg(U)\mathrm{reg}(U) are mirror images: Pr(Uu)=Pr(reg(U)reg(u)),for all u in the support of U\Pr(U \le u) = \Pr(\mathrm{reg}(U) \ge \mathrm{reg}(u)), \quad \text{for all } u \text{ in the support of } U An uncertainty measure UU that exactly satisfies this is termed rank-calibrated.

2. Mathematical Formulation of ERCE

ERCE quantifies the average magnitude of deviation from the ideal rank-calibration property. Let UU be the uncertainty score of a sample and UU' an independent copy. The population ERCE is defined as: ERCE=EU[PrU(reg(U)reg(U))PrU(UU)]\mathrm{ERCE} = \mathbb{E}_{U} \left[ \left| \Pr_{U'}( \mathrm{reg}(U') \ge \mathrm{reg}(U)) - \Pr_{U'}( U' \le U ) \right| \right] This evaluates, across the distribution of uncertainty values, how far the practical ranking of the uncertainty measure disagrees with the implied ranking from expected correctness.

A finite-sample estimator of ERCE proceeds as follows:

  1. Collect nn i.i.d. pairs (ui,ai)(u_i, a_i), for i=1,,ni = 1, \ldots, n.
  2. Sort and partition uiu_i’s into BB equal-mass bins.
  3. For bin bb, compute:
    • binwise average uncertainty: uctb=1IbiIbui\mathrm{uct}_b = \frac{1}{|I_b|}\sum_{i \in I_b} u_i
    • binwise average correctness: crcb=1IbiIbai\mathrm{crc}_b = \frac{1}{|I_b|}\sum_{i \in I_b} a_i
  4. For any sample ii in bin bb:
    • Pr^(Uui)=1B1bb1[uctbuctb]\widehat{\Pr}(U' \le u_i) = \frac{1}{B-1}\sum_{b' \ne b} \mathbf{1}[ \mathrm{uct}_{b'} \le \mathrm{uct}_b ]
    • Pr^(reg(U)reg(ui))=1B1bb1[crcbcrcb]\widehat{\Pr}(\mathrm{reg}(U') \ge \mathrm{reg}(u_i)) = \frac{1}{B-1}\sum_{b' \ne b} \mathbf{1}[ \mathrm{crc}_{b'} \ge \mathrm{crc}_b ]
  5. The empirical ERCE is: ERCE^=1ni=1nPr^(reg(U)reg(ui))Pr^(Uui)\widehat{\mathrm{ERCE}} = \frac{1}{n} \sum_{i=1}^n \left| \widehat{\Pr}(\mathrm{reg}(U') \ge \mathrm{reg}(u_i)) - \widehat{\Pr}(U' \le u_i) \right|

This formulation is invariant to monotonic transformations and the range of UU, focusing solely on rank relationships.

3. Theoretical Properties and Connection to Classical Calibration

ERCE generalizes the classical calibration error to ranking settings with non-binary correctness measures. Key characterizations include:

  • Theorem 1 (Equivalence to classical calibration for binary correctness): If A{0,1}A \in \{0,1\}, then ERCE=0\mathrm{ERCE} = 0 if and only if a strictly decreasing function gg exists such that g(U)g(U) is classically calibrated: Pr(A=1g(U)=c)=c\Pr(A=1 \mid g(U) = c) = c.
  • The converse holds: any calibrated confidence CC yields a rank-calibrated measure h(C)h(C) for any strictly decreasing hh.
  • Proposition 1 (ERCE is independent of ECE): For any pair (α,β)(0,1/2]2(\alpha, \beta) \in (0, 1/2]^2, one can construct CC with ERCE(C)=α\mathrm{ERCE}(C) = \alpha and standard Expected Calibration Error ECE(C)=β\mathrm{ECE}(C) = \beta.

These statements underline that ERCE and ECE capture distinct aspects of calibration: ERCE is purely about rank-consistency between uncertainty estimates and expected correctness, whereas ECE measures absolute probability calibration.

4. Algorithmic Procedure for Computing ERCE

A practical computation of ERCE for a dataset proceeds as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
1. Sort indices i by increasing u_i.
2. Partition indices into B bins of nearly equal size: I_b for b=1..B.
3. For each bin b:
   uct[b] = (1/|I_b|) * sum_{i in I_b} u_i
   crc[b] = (1/|I_b|) * sum_{i in I_b} a_i
4. For each bin b:
   frac_u_le[b]  = (1/(B-1)) * sum_{b' != b} 1[uct[b'] <= uct[b]]
   frac_reg_ge[b]= (1/(B-1)) * sum_{b' != b} 1[crc[b'] >= crc[b]]
5. ERCE = 0
   For each bin b:
      For each i in I_b:
         ERCE += abs(frac_reg_ge[b] - frac_u_le[b])
   ERCE = ERCE / n

The procedure leverages binwise averages for robust, interpretable rank assignments, and computes deviations between the fractional rank of uncertainty and the corresponding ranking induced by correctness.

5. Empirical Findings Across LLMs and Uncertainty Measures

ERCE was applied across four benchmarks (TriviaQA, Natural Questions, SQuAD, Meadow) and for three LMs (Llama-2-7b, Llama-2-7b-chat, GPT-3.5-turbo). Six representative uncertainty/confidence measures were evaluated:

  • UNLLU_\mathrm{NLL} (negative log-likelihood)
  • USEU_\mathrm{SE} (semantic entropy)
  • UEigVU_\mathrm{EigV}, UDegU_\mathrm{Deg}, UEccU_\mathrm{Ecc} (affinity-graph-based measures)
  • CVerbC_\mathrm{Verb} (verbalized confidence)

Representative ERCE values on TriviaQA (Rouge-L correctness, GPT-3.5) are as follows:

Uncertainty/Confidence Measure ERCE Score
UNLLU_\mathrm{NLL} 0.037\approx 0.037
USEU_\mathrm{SE} 0.051\approx 0.051
UDegU_\mathrm{Deg} 0.050\approx 0.050
UEigVU_\mathrm{EigV} 0.065\approx 0.065
UEccU_\mathrm{Ecc} 0.151\approx 0.151
CVerbC_\mathrm{Verb} 0.487\approx 0.487

Negative log-likelihood uncertainty yields the lowest (best) ERCE, semantic entropy is competitive, while affinity-graph eccentricity and prompt-based verbal confidences perform worst. These trends persist across datasets and LM architectures.

6. Advantages Over Threshold-Based and Range-Sensitive Metrics

ERCE addresses several persistent challenges in evaluating LM uncertainty:

  • No correctness threshold: Unlike metrics such as ECE, AUROC, or AUPRC that rely on thresholded or binarized correctness (e.g., ROUGE τ\geq \tau), ERCE operates on raw, continuous scores, thus avoiding threshold sensitivity.
  • Range-invariant comparison: By construction, ERCE depends only on the ranks of UU and reg(U)\mathrm{reg}(U), making it suitable for measures defined over [0,)[0, \infty) (e.g., entropy), [0,1][0,1] (e.g., degeneracy), or any other range.
  • Decoupling from absolute LM performance: High overall correctness does not automatically yield low ERCE. Instead, ERCE directly probes the rank-ordered fidelity of uncertainty.
  • Granular interpretability: Binwise differences between uncertainty and correctness ranks can be visualized (as “indication diagrams”), locally diagnosing systematic over- or under-confidence.

In essence, ERCE provides a unified, direct, and threshold-free assessment of how well an uncertainty measure ranks outputs by their expected correctness, filling a critical methodological gap in the calibration of modern LLMs for generation tasks (Huang et al., 4 Apr 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Expected Ranking Calibration Error (ERCE).