ERCE: Ranking Calibration Error for LMs
- ERCE is a metric that assesses rank-calibration by comparing the ordering of uncertainty scores with expected correctness in language model outputs.
- It computes the average deviation between ranks, using bin-based averaging of uncertainty and correctness measures for robust evaluation.
- ERCE provides a threshold-free evaluation that decouples from absolute performance, enabling fine-grained analysis of LM uncertainty and calibration.
Expected Ranking Calibration Error (ERCE) is a metric designed to quantify calibration quality of uncertainty measures for LLM (LM) outputs, specifically focusing on the relative ranking of uncertainty vis-à-vis generation quality. ERCE addresses the limitations of threshold-dependent and range-sensitive calibration metrics, providing a unified, principled approach for evaluating how well an uncertainty or confidence measure predicts expected correctness in natural language generation tasks (Huang et al., 4 Apr 2024).
1. Rank-Calibration: Formal Definition and Motivation
Rank-calibration is the foundational principle underlying ERCE. Given a query sampled from a data distribution, an LM generates a response . For each pair, two quantities are computed:
- : a correctness metric (e.g., ROUGE-L, METEOR, or human rating),
- : an uncertainty measure (higher implies greater uncertainty).
The calibration (regression) function is defined as:
The rank-calibration property requires that lower uncertainty () values correspond to higher expected generation quality, and that the rank of and are mirror images: An uncertainty measure that exactly satisfies this is termed rank-calibrated.
2. Mathematical Formulation of ERCE
ERCE quantifies the average magnitude of deviation from the ideal rank-calibration property. Let be the uncertainty score of a sample and an independent copy. The population ERCE is defined as: This evaluates, across the distribution of uncertainty values, how far the practical ranking of the uncertainty measure disagrees with the implied ranking from expected correctness.
A finite-sample estimator of ERCE proceeds as follows:
- Collect i.i.d. pairs , for .
- Sort and partition ’s into equal-mass bins.
- For bin , compute:
- binwise average uncertainty:
- binwise average correctness:
- For any sample in bin :
- The empirical ERCE is:
This formulation is invariant to monotonic transformations and the range of , focusing solely on rank relationships.
3. Theoretical Properties and Connection to Classical Calibration
ERCE generalizes the classical calibration error to ranking settings with non-binary correctness measures. Key characterizations include:
- Theorem 1 (Equivalence to classical calibration for binary correctness): If , then if and only if a strictly decreasing function exists such that is classically calibrated: .
- The converse holds: any calibrated confidence yields a rank-calibrated measure for any strictly decreasing .
- Proposition 1 (ERCE is independent of ECE): For any pair , one can construct with and standard Expected Calibration Error .
These statements underline that ERCE and ECE capture distinct aspects of calibration: ERCE is purely about rank-consistency between uncertainty estimates and expected correctness, whereas ECE measures absolute probability calibration.
4. Algorithmic Procedure for Computing ERCE
A practical computation of ERCE for a dataset proceeds as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
1. Sort indices i by increasing u_i. 2. Partition indices into B bins of nearly equal size: I_b for b=1..B. 3. For each bin b: uct[b] = (1/|I_b|) * sum_{i in I_b} u_i crc[b] = (1/|I_b|) * sum_{i in I_b} a_i 4. For each bin b: frac_u_le[b] = (1/(B-1)) * sum_{b' != b} 1[uct[b'] <= uct[b]] frac_reg_ge[b]= (1/(B-1)) * sum_{b' != b} 1[crc[b'] >= crc[b]] 5. ERCE = 0 For each bin b: For each i in I_b: ERCE += abs(frac_reg_ge[b] - frac_u_le[b]) ERCE = ERCE / n |
The procedure leverages binwise averages for robust, interpretable rank assignments, and computes deviations between the fractional rank of uncertainty and the corresponding ranking induced by correctness.
5. Empirical Findings Across LLMs and Uncertainty Measures
ERCE was applied across four benchmarks (TriviaQA, Natural Questions, SQuAD, Meadow) and for three LMs (Llama-2-7b, Llama-2-7b-chat, GPT-3.5-turbo). Six representative uncertainty/confidence measures were evaluated:
- (negative log-likelihood)
- (semantic entropy)
- , , (affinity-graph-based measures)
- (verbalized confidence)
Representative ERCE values on TriviaQA (Rouge-L correctness, GPT-3.5) are as follows:
| Uncertainty/Confidence Measure | ERCE Score |
|---|---|
Negative log-likelihood uncertainty yields the lowest (best) ERCE, semantic entropy is competitive, while affinity-graph eccentricity and prompt-based verbal confidences perform worst. These trends persist across datasets and LM architectures.
6. Advantages Over Threshold-Based and Range-Sensitive Metrics
ERCE addresses several persistent challenges in evaluating LM uncertainty:
- No correctness threshold: Unlike metrics such as ECE, AUROC, or AUPRC that rely on thresholded or binarized correctness (e.g., ROUGE ), ERCE operates on raw, continuous scores, thus avoiding threshold sensitivity.
- Range-invariant comparison: By construction, ERCE depends only on the ranks of and , making it suitable for measures defined over (e.g., entropy), (e.g., degeneracy), or any other range.
- Decoupling from absolute LM performance: High overall correctness does not automatically yield low ERCE. Instead, ERCE directly probes the rank-ordered fidelity of uncertainty.
- Granular interpretability: Binwise differences between uncertainty and correctness ranks can be visualized (as “indication diagrams”), locally diagnosing systematic over- or under-confidence.
In essence, ERCE provides a unified, direct, and threshold-free assessment of how well an uncertainty measure ranks outputs by their expected correctness, filling a critical methodological gap in the calibration of modern LLMs for generation tasks (Huang et al., 4 Apr 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free